Assembly Optimizations for OMAP4

= Architecture =

1.) OMAP4 chips have an Dual Core Cortex A9 ARM Processor as the main processing unit. 2.) The main memory is access through a dual channel controller. This means it can support 2 transactions at once. 3.) The A9 L1 cache is XXKb (per core) 4.) The A9 L2 cache is 32Kb (shared between cores) - random evacuation. 5.) 4-Deep Preload Pipeline

= Rules of Assembly Optimization =

1.) Learn the Assembly Language you are using. 2.) Know the Hardware you are using.

= Optimization Strategies =

How to converting your C code to optimized assembly is the topic of many, many books. Here's a concise description of some methods of tricks to use for converting C code for OMAP4.

Data Parallelism
The NEON pipeline is normally 8 data units wide, though some instructions do less. Normally a for loop in C which does a memset would look something like this:

void memset(void *ptr, int data, size_t size) { for (int i = 0; i < size; i++) { ((uint8_t *)ptr)[i] = (uint8_t)data; } }

The NEON equivalent operates on this data in units of 8 unsigned characters, which means that the size must be a multiple of 8. .global memset_neon memset_neon: ptr .req 0 data .req 1 size .req 2 PROLOG r0,r2 vdup.8 d0, r1          # Put the data value to set in d0 from r1 (will be 8 bit truncated) memset_neon_loop: vst1.8 {d0}, [r0]! # Store 8 values at once, and update the pointer by 8 subs  r2, r2, #8       # reduce size by 8 bgt   memset_neon_loop # loop if more bytes are left EPILOG r0,r2 .unreq ptr .unreq data .unreq size

In this manner, 8 units are operated on in parallel.

Thread Parallelism
Since OMAP4 is a dual core and has a dual channel memory controller you write thread parallel operations which each have NEON optimized code, one thread per core which receives full memory access speed. However, due to L2 cache sharing, there is less of a guarantee that data will stay in the cache for the same length of time.

Preloading
If you are going to be operating on a large array it benefits you to preload that data into the L2 using the "pld" instruction. This fetches the data from the main memory and populates the L2 cache with the data.

= References =

ARM Info Center - NEON Code Reference for the RVDS compiler. GCC's assembly syntax is very similar (though not all mnemonics are supported).

NEON Reference in PDF form through Google Docs: