L3 bus monitoring SW tool

How-to is readable, will only add 1 graph Technical details need some cleaning/refactoring

= Statistic collectors overview = L3 NOC Statistic collectors is an HW IP that computes traffic statistics. It relies on HW probes located on EMIF (external memory) or initiators (DSS, IVA, ISS, ...). It can be programmed in 2 different ways:
 * CCS+XDS560pod: within a user defined window, periodically reporting to the user through the DEBUGSS interface
 * SW tool: as 32-bit free-running accumulation counters that can be read from MPU on OMAP4470/5430

SW tool leverages only EMIF HW probes therefore it can't monitor traffic that is not related to EMIF, i.e. L3 traffic such as read/write registers, direct traffic from 1 HW IP to another IP (MPU-ABE)... Hopefully most of the traffic goes through external memory

"How-to" describes the SW tool and its usage. Further technical details are available in last sections



= SW tool how-to =

WARNINGS

 * OMAP4/OMAP5 ES1.0 HW bug: prevents correct reading of 32kHz synctimer => Use option -n ("nosleep" of domain) to get correct results on OMAP5: OMAP4 already has a work-around, OMAP5 ES1.0 will have one in near future and ES2.0 will no longer have the issue
 * Tool is userspace so can be pre-empted any time
 * this is only relevant to -a 1 mode where capture delay can be as small as several hundreds of us
 * timestamp and registers may not be read at same time, leading to glitches in results. No info is lost, one capture is simply not accurate, be careful of non sustained peaks -> post-processing mitigates that by merging this capture with next capture.
 * tool may sometimes not wake-up regularly, delays up to 2ms were seen rarely on 2 consecutive points -> no info is lost also, just accuracy
 * HW IP reset
 * if you wrongly set --overflow_delay or -o -t, 1 counter will saturate and an ERROR message is printed. Redo the capture
 * Potential HW bug
 * in rare cases, register can take values A - 2X, A - X, A, A - 1000 (instead of A + X), A + 2X. No info is finally lost but A - 1000 is obviously wrong.
 * Tool fixes the issue by forcing "A - 1000" to A

Definition and filtering of counter
HW IP has 8 configurable counters. Each counter can track EMIF traffic:
 * on EMIF1 or EMIF2
 * for all DMM traffic, for DMM traffic of 1 specific initiator or for non DMM MPU traffic (MA_MPU, see above figure)
 * Read transactions, Write transactions or Read+Write transactions
 * the 2 last counters can only track Read+Write transactions of all DMM traffic or non DMM MPU traffic on EMIF1 or EMIF2
 * monitoring 1 initiator with full granularity requires 4 counters: W EMIF1, W EMIF2, R EMIF1, R EMIF2
 * SW tool uses by default 4 counters. If you configure counter N, tool will capture counters 0 to N (but not less than 4 counters)

Default mode = accumulation mode 2

 * omapconf trace bw -h -> look at the list of initiators at beginning of help (ma_mpu, alldmm, dss, iva, gpu_p1, ...) and -n option for OMAP5

Counter: 0 Master: alldmm  Transaction: w Probe: emif1 Counter: 1 Master: alldmm  Transaction: w Probe: emif2 Counter: 2 Master: alldmm  Transaction: r Probe: emif1 Counter: 3 Master: alldmm  Transaction: r Probe: emif2
 * omapconf trace bw -> monitors all "DMM" EMIF traffic, i.e all EMIF traffic except MPU (MA_MPU) direct paths to EMIF

time: 823237498 823204726 32772 -> 0.00 0.00 141.92 141.92 time: 823270313 823237498 32815 -> 0.00 0.00 141.90 141.90 ...


 * Format: "time:    -> Write_EMIF1 Write_EMIF2 Read_EMIF1 Read_EMIF2 throughput in MB/s
 * Example here is "homescreen-no scroll" on Tablet2 so only DSS contributes. 1280x800 framebuffer goes to 1 DSS pipeline: (141.92 + 141.92) / 1280 / 800 / 4 (RGB32) = 69.3Hz
 * HW IP is reset every second to avoid registers overflow (this is extremely fast, less than 1 us so no real info is lost)
 * Read_EMIF1 = Read_EMIF2 here as traffic is well interleaved over EMIF1 and EMIF2

Basic method

 * omapconf trace bw -m dss -> restricting all counters to DSS master/initiator and uses 4 counters to track Write_EMIF1 Write_EMIF2 Read_EMIF1 Read_EMIF2 throughput in MB/s


 * omapconf trace bw -m ma_mpu -> MPU "non DMM" traffic. See other sections for explanation of MPU direct memory path to EMIF, almost all MPU traffic goes there


 * omapconf trace bw -m gpu_p1 (gpu_p2) -> SGX1 P1 (SGX P2). Some HW IP has 2 memory ports. Because 2 last counters can't filter, we must capture SGX in 2 passes or monitor R+W transactions. Note that interleaving on EMIF1 and EMIF2 is generally OK so you may do a good approximation of throughput as 2 * P1 instead of P1 + P2 (after preliminary check)

Flexible method

 * omapconf trace bw --m0 dss --m5 dss -> counters 0 and 5 will filter DSS traffic (this also forces counters 0, 1, 2, 3, 4, 5 to be captured)

Filter transaction

 * omapconf trace bw --tr r+w (or --tr3 r+w) -> all counters (respectively counter 3) will monitor R+W transactions

Filter EMIF (probe)

 * omapconf trace bw -p emif1 (or --p7 emif2) -> all counters (respectively counter 7 on emif2) will monitor EMIF1

Tune delay

 * omapconf trace bw -d 2000 -> capture every 2 seconds


 * omapconf trace bw -d 100 -> every 100ms


 * omapconf trace bw -d 0.3 -> every 300us. Do not try it as tracing cost makes it too intrusive, you have to choose another accumulation mode

Examples
omapconf trace bw --m4 ma_mpu --m5 ma_mpu --m6 ma_mpu --m7 ma_mpu -> 4 counters to capture alldmm R and W EMIF1/EMIF2, 4 counters to capture ma_mpu R EMIF1/EMIF2, R+W EMIF1/EMIF2 (so you can compute W)
 * alldmm and ma_mpu, full granularity

omapconf trace bw --tr r+w --m0 dss --m1 dss --m2 gpu_p1 --m3 gpu_p1 --m4 ma_mpu --m5 ma_mpu --m6 alldmm --m7 alldmm
 * R+W, alldmm, ma_mpu and other initiators

omapconf trace bw --tr r+w --m0 dss --m1 gpu_p1 --m2 gpu_p2 --m3 iva --m4 bb2d_p1 --m5 bb2d_p1 --m6 ma_mpu --m7 alldmm
 * we know traffic is well balanced so we track only 1 emif (emif1 or 2), except for GC320:

omapconf trace bw --m0 dss --tr0 r --p0 emif2 --m1 gpu_p1 --tr1 w --p1 emif1 --m2 gpu_p2 --tr2 r+w --p2 emif2 --m3 iva --tr3 w --p3 emif1
 * full flexibility

Methodology
For a quite regular use case, it is suggested to do:
 * omapconf trace bw -> all traffic (except MPU)
 * for each initiator, omapconf trace bw -m aaa until sum of throughputs is "all" throughput
 * do not forget -m ma_mpu to add on top of "all" traffic

Example: video playback ...
 * omapconf trace bw, Ctrl-C
 * omapconf trace bw -m dss, Ctrl-C
 * omapconf trace bw -m iva, Ctrl-C

Accumulation mode 1
At very high capture rate, dumping immediately is too intrusive. Tool can store register values + timestamp in an array and dump results at the end of test. This is "accumulation mode 1", which requires to always set number of iterations. You may also tune HW IP auto-reset delay/threshold

Tune number of iterations
After 100000 iterations, i.e. 100000*0.3 = 30s, trace displays: 0,0,0,S,,SDRAM,,0,EMIF 0:Wr:All Initiators,T,V,77,,,,0, -> Write_EMIF1, timestamp 77 ticks 32kHz 0,0,0,S,,SDRAM,,0,EMIF 1:Wr:All Initiators,T,V,77,,,,0, -> Write_EMIF2 0,0,0,S,,SDRAM,,52992,EMIF 0:Rd:All Initiators,T,V,77,,,,0, -> Read_EMIF1 0,0,0,S,,SDRAM,,53120,EMIF 1:Rd:All Initiators,T,V,77,,,,0, -> Read_EMIF2 0,0,0,S,,SDRAM,,0,EMIF 0:Wr:All Initiators,T,V,89,,,,0, -> timestamp 89, i.e 12 ticks=366us later 0,0,0,S,,SDRAM,,0,EMIF 1:Wr:All Initiators,T,V,89,,,,0, 0,0,0,S,,SDRAM,,59008,EMIF 0:Rd:All Initiators,T,V,89,,,,0, 0,0,0,S,,SDRAM,,59008,EMIF 1:Rd:All Initiators,T,V,89,,,,0, ...
 * omapconf trace bw -a 1 -d 0.3 -i 100000 

Max number of iterations is 1000000. You can interrupt tool by Ctrl-C, tool will dump current content of array. So "iterations" option could be removed in the future. You can simply put always 1000000 and use Ctrl-C

Format is compatible with CCS output format (for which custom post-processing script was written)

Tune HW IP reset threshold
HW counters keep accumulating and saturate at 2^32 without resetting to 0. Only SW solution is to stop/start the IP. To avoid intrusiveness, SW tool is simple and stupid, user must tune when to do it.

By default, HW is reset every second. Reset takes less than 31us, i.e. error is less than 0.003%

--overflow_delay method
option simply changes delay between auto-resets of HW IP

counter threshold -o -t method

 * Example: omapconf trace bw -o 3 -t 4000000000 -o 2 -t 1000000000
 * user chooses which counters to monitor. Max 2
 * user chooses threshold that triggers reset. For example, you know Read_EMIF2 is 400MB/s max. If capturing every 2s, set threshold to 2^32 - 400MB * 2 - "good margin"

Either you know well your use case throughput, either do some 'omapconf trace bw first to find biggest contributor and its threshold

Post-processing/Visualization

 * sudo apt-get install python-matplotlib
 * git clone git://gitorious.tif.ti.com/omap-video-perf/runperf.git, in instrumentation/bandwidth, look for BWstats_ccsv5.py

Below example is DSS bandwidth during Wifi Display. "Write" corresponds to Writeback pipeline activity. The low peaks at 80MB/s correspond to end of frame, i.e. VSYNC. So you can get display refresh rate



= HW IP technical details = Statistics collectors are used for target load and master latency monitoring (SDRAM and LAT0/LAT1 collectors). SW tool currently exposes only SDRAM collector.

target load / EMIF probes
These probes are attached to EMIF. MPU transactions are not routed through the L3 NOC due to latency constraints therefore different probes monitor these memory paths:
 * 1 probe monitors DMM traffic to/from EMIF1 and 1 to/from EMIF2 (emif1/2_probe in above figure)
 * 1 probe monitors MPU direct memory access to/from EMIF1 and 1 to/from EMIF2 (ma_mpu1/2_probe, ma=memory adaptor)

"DMM" probes can filter per initiator and per types of events. SW tool only handles "payload" event to get BW throughput.

master latency / L3 master probe
The probes are attached to most critical L3 masters so are the only probes that can compute traffic latency. But they have in fact same features than above probes like monitoring also payload. Not leveraged by SW tool

= SW tool principles =

"EMIF" probes counters
Traffic events monitored by "EMIF" probes (target load) are accumulated into counters. They have following granularity:
 * monitor all initiators or only 1 initiator
 * monitor Write, Read or Write+Read traffic
 * monitor various events

There are 8 but only 6 can filter. SW tool currently monitors separately Read and Write traffic on EMIF1 and EMIF2 with 4 counters

Counters capture principle/MPU intrusiveness
Tool configures 4 counters to monitor Write EMIF1 / Write EMIF2 / Read EMIF1 / Read EMIF2. Counters will all monitor "all DMM" traffic, traffic from only 1 initiator or MA_MPU traffic.

Main loop will do:
 * read all counters and timestamp with 32kHz -> dump immediately on terminal or store in an array to dump at end of test
 * sleep X ms or us

Apart from being readable by most of HW IPs, 32kHz timestamping allows having only the cost of reading 1 register rather than a system call + reading register

SW tool reads various registers in userspace. This is then intrusive in MPU processing. Pre-emption can occur therefore there is no guarantee that timestamp and counters are read at the same time

= Choose between CCS and SW tool =
 * CCS:
 * accuracy below 300us
 * no MPU intrusiveness
 * full flexibility of configuration of each HW counter
 * no need to reset HW IP to prevent saturation
 * official maintenance
 * 4430/4460


 * SW tool:
 * no set-up, no XDS560 pod, no removal of resistors on the board
 * control tool and fetch the log directly from platform, easy automation
 * 32kHz timestamping, easy correlation to clock_gettime(CLOCK_MONOTONIC) and gettimeofday
 * unlimited capture time for accumulation mode 2, several minutes for accumulation mode 1