Writing ARM Assembly

= Overview = This page will go over the basics of writing ARM assembly on the OMAP platform against the GCC family of compilers and assemblers. If you have assembly that is in NASM format, you can port it over using the guide at Porting NASM Assembly to GCC. For OMAP4 Specifics, see Assembly Optimizations for OMAP4.

= Reference = For assembly instruction references, refer to ARM's site http://infocenter.arm.com/help/index.jsp for the specific processor type in the OMAP you are using.

To figure out which instruction set you can use and thus if you can have NEON or some subset of parallel instructions, see this table:

(1) Some variation exists.

= Makefiles = You'll need to make sure that your Makefile supports cross compiling against the ARM assemblers. See OMAP Platform Support Tools. When compiling or assembling the assembly files, be sure to set your $(CC).

CC=$(CROSS_COMPILE)gcc AS=$(CROSS_COMPILE)as

= Assembly Files = Assembly files have historically been named with a .S or .s extension. Use .S to be able to pass the file through the C++ preprocessor as well as the assembler.

Parameters are named r0-r3 here to show how the assembly registers translates these into parameters. Parameters beyond 4 are pushed onto the stack. If you can't avoid going over this, there are ways to pull the additional parameters off the stack in the assembly into the r4-r11 registers in the prolog.

Comments
Comments should be either used with /* comment */ or the per line comment #.

Calling C functions from Assembly
Calling C functions from assembly is largely an issue of setting up the parameters correctly and then branching to the function.

In your C files define your function (no need to declare, unless other C functions call it). int somefunc(int r0, int r1, int r2, int r3) {   // does something }

In the Assembly File: .extern somefunc

And in the subroutine itself: # move parameters manually to r0, r1, r2, r3   bl somefunc # return code is in r0, r4-r11 should be preserved

If you need to add additional parameters to the stack you must also remove them after the function call to keep the sp correct.

Calling Assembly Functions from C
First, define your functions in a C header file so that the C/C++ code can find the prototype for it.

/** This is simple function which just returns 0 */ int function(int r0, int r1, int r2, int r3);

Second, you'll have to define the function or symbol in the assembly file. Naming it a global variable will allow the linker to find it and resolve the symbol in the C file.

.global function function:

EABI Calling conventions
In the EABI spec http://en.wikipedia.org/wiki/Application_binary_interface#EABI defines how functions are called, how stacks are used, which registers do what, etc. This allows assembly and C to link together successfully (even across different compilers which support EABI). The calling conventions can be found http://en.wikipedia.org/wiki/Calling_convention#ARM. The EABI standard dictates that the ARM Stack be "Full Descending" which means that stores need to decrement beforehand and loads must increment afterward. You can use the actual addressing types "DB" and "IA" or just "FD" on the assembly instructions.

Prolog
The prolog saves the state of the registers r4 through r11 typically (you can save any amount you need to, but those are the typical ones). This instruction also post-updates the stack pointer (sp).

stmdb sp!, {r4-r11} /* Push 8 "longs" on the stack and subtracts sp beforehand */

If there are additional parameters on the stack you can reference them after the stmia instruction, but you'll need to offset the sp by the appropriate values. This *assumes* that you use {r4-r11}.

ldr r4, [sp, #(4*9)] /* This loads parameter 5 which is 9 "longs" "up" on the stack now */ ldr r5, [sp, #(4*10)] /* This loads parameter 6 which is 10 "longs" "up" on the stack now */

Epilog
The epilog restores the previous register set from the stack back to the registers and updates the sp value.

ldmia sp!, {r4-r11}

Return
The return places the return value into r0 and moves the lr (the return address) into the pc. This will cause the next instruction fecthed to be the instruction after the call to the function.

mov r0, #0 mov pc, lr

Optimized Return
You can reduce your code size by also popping the LR from the stack back into the PC, which also acts as the "return" statement. Here I use the "FD" stack mode.

stmfd sp!,{r4-r11,lr} # stack save + return address ... ldr r4, [sp, #(4*10)] ... ldmfd sp!,{r4-r11,pc} # stack restore + return
 * 1) use 10 as the additional offset for other parameters off the stack since we're saving 9 ints now

Register Renaming
With the Gas style assemblers, you can rename registers to aid in readability.

name .req register

Example:

pixels .req r0 width .req r1 height .req r2

mul pixels, width, height

Complete Listing
.global function function: # prolog stmdb sp!, {r4-r11} ...    # epilog ldmia sp!, {r4-r11} # return value goes into r0, here it's zero mov r0, #0 mov pc, lr

Defining Strings
The assembler allows you to define strings in the format (with special characters):

.global final_message final_message: .string "Sorry for the Inconvenience\n"

Use a label before the string in order to reference it.

Defining Constants
The GNU assembler takes constants in the form of .equ symbol, value

Such that you could do this (capitalization is optional): .equ ANSWER_TO_LIFE_UNIVERSE_EVERYTHING, 42

Defining Data Arrays
When you need to define large static arrays of data (tables, precomputed values, multiple constants, etc.) you can use a data section to do this. This is not quite the same as the .data section (which can be static data or functions).

.global my_array my_array: .long 127 .long 28 .long 94 .long 23

This symbol can be then be used and to load these values into registers to apply to calculations, etc.

ldr r4, =my_array ldr r5, [r4, #0x0] ldr r6, [r4, #0x4] ldr r7, [r4, #0x8] ldr r8, [r4, #0xC]

Types
Each type can be zero (?) or more expressions.

.byte 247        /* is 8 bit  */ .word 2098       /* is 16 bit */ .long 10238476   /* is 32 bit */ .quad 23487928374 /* is 64 bit */ .octa 928374928734982734 /* is 128 bit */ .float 3.141528  /* is 32 bit IEEE floating point. */

.byte 0xEF, 0xBE, 0xAD, 0xDE /* Byte sequence 0xDEADBEEF in LITTLE ENDIAN */

Defining Macros
The GNU assembler also allows macros which can be used to simplify some assembly routines.

.macro name operand [,operand,...] [instructions] .endm

Here's an example that does a 4 value average .macro avgerage avg,sum,a,b,c,d add \sum, \a, \b add \sum, \c, \sum add \sum, \d, \sum mov \avg, \sum, lsr #2 .endm
 * 1) avg = (a+b+c+d)/4;

Odd's n' Ends
You should define your assembly file with .text at the beginning and .end at the end.

.text ... .end

Enabling NEON
If you are assembling for ARMv7 instructions (NEON) then you must state so in the Makefile in the AFLAGS as -march=armv7-a or -mfpu=neon. You can also state so in the assembly file as:

.arch armv7-a .fpu neon

Register Usage
There's a good table reference for which registers are used for what in GCC (during inline assembly at least) at, under "Register Usage".

= Harware Considerations =

When programming in NEON, there are several considerations as to how to craft the subroutine.

Stack Direction
This effects the prolog and epilog (stack push and pop). Know whether it should be a down-stack (EABI compliant) or some other version:

stmdb sp!,{r4-r12,lr} # push + save return ... subroutine ldmia sp!,{r4-r12,pc} # pop + return

Loops
The most efficient loops on an ARM are not the traditional "for" loops but decrementing branch loops

i .req rX limit .req rY mov i, limit label ... # code subs i, i, #1 bgt label

This is syntactically equivalent to this for loop:

for (i = limit; i > 0; i--) {}

This loop works by setting the status register which the "bgt" uses. The "subs" is effectively a "sub"+"cmp" in one instruction.

Prefetch
Prefetch is probably the single most important part of NEON optimization. You *must* use the "pld" instruction to get any significant speed up. You must know how many prefetches can be executed before the prefetch queues is full or capped. The table above gives some ideas about this. Additionally you should know the architecture's cache line size so that the data is contiguously

L2_CACHE_LINE_SIZE .equ 64 # on A15

pld r0 pld r0, #L2_CACHE_LINE_SIZE ... repeat until capped

Stalls
When using intermediate registers, the usage of these registers should be staggered or separated by some instructions if possible to prevent stalls being introduced in the pipeline. Instead of this:

vadd.u8 d0, d1, d2 vadd.u8 d3, d0, d4 vadd.u8 d5, d6, d7 vadd.u8 d8, d5, d9

Where d0 and d5 are used directly after being computed, it would be better to rearrange the operations to put some distance between the store and load of these registers to prevent pipeline stalls.

vadd.u8 d0, d1, d2 vadd.u8 d5, d6, d7 vadd.u8 d3, d0, d4 vadd.u8 d8, d5, d9

This is a simple trick which speeds up computation.

Write Combiner
One trick used to get better write performance is the "write combiner". This allows smaller writes of a few bytes to be aggregated into larger writes which are more efficient on the bus. This is typically 128 _bits_ or 16 bytes (one Q register or two D registers). This means you may unroll your loops and do two or more computations per cycle to aggregate enough write-data to fill the combiner. There are some limitations and restrictions to using this, so read the manual.

= Reference =

DVP YUV NEON Color Convert Functions
 * Texas Instruments has a "Distributed Vision Processing" framework for OMAP4 which contains some NEON code for doing some color convert functions.
 * Texas Instruments also has a blitting API which contains some NEON acceleration: Bltsville