Distributed Vision Processing

= Glossary =
 * IPC - Inter-processor Communication - the generic concept of run-time communication between two or more hardware cores.
 * DVP - Distributed Vision Processing
 * Machine or Computer Vision Kernel - a function or method that operates on input data to produce output data which is specific to Machine or Computer Vision.
 * Kernel Enum - The enumerated value which is used to identify a particular machine vision kernel.
 * Node - a Kernel Enum plus it's associated parameters contained in a single data structure.
 * Section - a linear series of one or more Nodes.
 * Order - a set of one or more Sections which are intended to execute in parallel.
 * Graph - a set of Orders which execute in series.
 * Manager - a shared object which contains either the logic to access remote cores which execute kernels or the local kernels themselves. Managers contain a table which contains the list of supported kernels and their relative load values used to calculate the Estimated Load for each particular kernel it manages. Multiple managers may support the same kernel. Prioritization is discussed later.
 * Boss - the top level manager who maintains knowledge of the entire DVP system and decides which kernels run on which Manager. The structure of the Graph dicates the order of execution. The Boss executes on the ARM core (A9 on OMAP4).
 * HLOS - High Level Operating System - Windows, Linux, QNX, etc.
 * SOSAL - Simple Operating System Abstraction Layer. This layer is used to abstract the various supported HLOSes.
 * SIMCOP - the Image Accelerators in the OMAP4/5 ISP.
 * Tesla - The mini C64T DSP in OMAP4/5.
 * Binder - the IPC mechanism for CPU process to process calls and Java App to Java App in Android.
 * JNI - Java Native Interface - the mechanism in which Java code can call C/C++ code.
 * Syslink - the IPC mechanism for OMAP 4/5/6 core to core calls.
 * Daemon - a "privileged" program which runs some service or feature for the device.

= Design =

Rationale
The design rationale of DVP is to create a systematic way to process machine vision kernels across multiple cores in a heterogeneous computing system like the OMAP4430, leveraging specialized hardware which can greatly accelerate specific machine vision kernels. DVP is a generic framework of kernels, but is not a generic computation language like OpenCL. Each kernel is precompiled for it's desired core and is accessible as a Node in the DVP Graph.

Why not OpenMAX?
OpenMAX is not the right interface for VISION. DVP needs more capabilities and lower overhead (at least 1 ipc per graph, not per message) than OMX can provide. OMX is specific to media codecs, not vision kernels (which are conceptually like single function calls).

Manager Prioritization
DVP internally priorities some hardware blocks over others due to implicit performance advantages due to hardware designs. If multiple Managers support the same kernel, DVP will internally determine how to prioritize whose kernel is called. DVP will use the Load Balancing information as a second level of decision making.

Currently the prioritization for OMAP4 is:   simcop  dsp  cpu 

This means if there is a Kernel "A" implemented on all Managers, it will prefer to execute the kernel on the highest priority Manager first working it's way down the priority list only when the Core that the Manager works on is exhausted of resources.

Multiple DVP Instances
DVP can execute multiple Graphs in parallel within the same process or across multiple processes. The Estimated Load Table is in a semaphore protected piece of shared memory so that multiple processes can utilize the DVP system at once. Each process gets it's own instance of the Managers, but there is only one Estimated Load Table.

Graphs
Machine Vision Kernels can be called in "bulk" by formating a Graph which indicates the exact set of kernels to call, in what order, and their associated parameters.

In this example, we have Kernels A,B,C,D,E,F,G,H which have Nodes a,b,c,d,e,f,g,h respectively. In an unoptimized example, Nodes a through h can be called in series (in a single Section). The Graph is then:

a -> b -> c -> d -> e -> f -> g -> h In an optimized version the programmer may have discovered that some nodes are not dependent on previous nodes and can be reordered and made parallel to gain performance. In this example, 'b', 'd' and 'e' depend on 'a', 'c' on 'b', 'f' on 'e', 'g' on 'c' and 'f' and 'h' on 'g'. There are now 5 Sections, 'a', 'bc', 'd', 'ef', and 'gh'. and three Orders: 'a' has 0, 'bc','d', and 'ef' have order 1, and 'gh' has order 2.

|-> b -> c -> | /|            |\   a |-> d         | g -> h    \|             |/ |-> e -> f -> | Order: 0    1           2

In this example b,h, and e depend on a but not on each other. This Graph will execute 'a' first and wait until completion, then will concurrently launch 'bc', 'd', 'ef' (potentially in parallel on an SMP system), then when those have completed, it will launch 'gh' and wait for completion.

Allocating Memory
DVP supports multiple memory types for Images and Buffers.
 * Plain Virtual Memory
 * 1D TILER Cached Memory
 * 1D TILER Uncached Memory
 * 2D TILER Uncached Memory
 * Diplay Buffers (2D TILER Uncached) Memory

On systems where the TILER API is exposed to DVP directly it will use it to allocate 1D/2D memory. On Host systems, only plain virtual memory is supported.

Section Completion Callbacks
After each section completes execution a callback is issued to the client to notify them of completion. This callback has several features.
 * The Callback is internally semaphore protected so that the client does not need to implement a protection mechanism.
 * The Callback is intended to allow the client to do any necessary operations between sections of a DVP Graph. These could be:
 * Private Algorithms performed on the output of the previous section
 * Logical operations to determine if future sections need to run. If not, future sections can be marked as "skip" so they will not be executed. The Client would be responsible for clearing the "skip" flag before the next Graph execution.

Latency
One of the biggest challenges of using heterogeneous multi-cores is the latency involved in IPC. This is minimized through offloading as many tasks as can be sent at once in a single transmission. In the context of DVP, this means sending as many kernels to execute on a remote kernel at once as possible. In some cases like SIMCOP this may not be possible as the number of supported kernels is low. However the DSP can process many different types of kernels, and is a good candidate to offload a myriad of tasks until it is fully utilized. The effect of this optimization of work in the Graph is to coagulate as many core-centric operations into a single Section as possible. Sections are analyized to see how many nodes ahead of the current node can be sent together to the appropriate remote core. In this manner, entire sections can be offloaded to the remote cores, thus greatly improving local loading and minimizing per Node latency.

Local Optimization
Each Manager can locally optimize Graph performance, beyond what the Boss may understand. For example, the CPU Manager may have some specialized assembly routines to do an optimized version of a kernel if the right conditions are met (specific parameters, combination kernels will subsequent kernels, etc). Each Manager can and must make these determinations internally. These optimized kernels should only be used if the overhead of checking for and running the optimization is greatly outweighed by the Mhz saving. Programmers of customer Managers should carefully weight optimization checks.

Dependencies
DVP on all OMAP4 platforms (Android/QNX/etc.) depends on the Syslink driver (info at Syslink Project) and TILER memory allocator.

ICS and JB
Android ICS release changes the IPC mechanism to the Ducati and Tesla cores to use an interface called RPMSG which itself is built upon the VirtIO framework for kernel level virtualization. Underneath all the layers is still the Mailbox HW driver.

ICS also adds the ION memory manager, which implements a unified method of allocating 1D/2D buffers in the system.

Extending DVP
It is relatively straight forward to extend DVP to provide private implementations of some kernels. The Kernel Enum list has a definition for a "user" enum base which can be used to create custom kernel enums. These must simply not conflict with existing enums in the system. THe Boss will scan all Managers for exported kernel enums and will execute the kernels on those Managers given the prioritization of the Managers. If no Manager supports a kernel except the new extended Manager, then prioritization is not an issue. Prioritization is only considered when two or more Managers contain a kernel enum.

Compiling
Each new DVP Manager simply needs to implement the existing DVP Manager API (or duplicate the DVP CPU Manager code and replace the switch statement with your own enums and code).

Loading
On HLOS platforms with scandir and fnmatch implemented, the DVP Boss will dynamically load any shared object with the appropriate name (" /dvp_kgm_XXXX.so"). While this might seem dangerous from a security point of view, the Boss will specifically load only from the system library paths, which must themselves be compromised in order to breach security.

Memory Allocation and Usage
DVP allows the programmer to allocate memory in many formats, depending on the local HLOS. When DVP is running on a system with a TILER, DVP can allocate 1D cached and 1D uncached and 2D uncached tiled memory. Normally most allocations are allocated via malloc, calloc or memalign. The RPC layer of DVP understands the Cache issues associated with remote core execution and works to keep buffers consistent after Section executions.

No Data Dependencies
DVP does not assume that it knows better than the programmer. It will execute Nodes in the order that the programmer gave it. DVP allows the programmer to assemble and execute a Graph regardless of how the Data dependencies work out. This means that the programmer may be able to construct a bad graph (incorrect dependencies). The onus of correct behaviour is left to the programmer. The trade-off here is code complexity and run-time overhead versus development-time overhead. If the Managers determines that the kernels can be done in a more efficient manner, it may do so. An example of this is a combined kernel which may take 1 input and produce 3 outputs which would normally be done individually.

No reordering of Graph Sections/Nodes/Kernels
DVP has been designed thus far to assume that the programmer is the best optimizer, not a complex graph dependency system.

Camera Considerations
Machine Vision has a fundamentally different approach to camera control than does Human Vision. Typically sensor tuning and camera controls are designed with Human Vision consumption in mind and not anything else. Machine Vision does not care about aesthetically pleasing images. Machine Vision "care-abouts" can be more varied and are functionally driven to what the Machine Vision algorithms being used are. To that end, camera which need to enable Machine Vision need the following functonality:
 * Manual White Balance Control - Enable/Disable of AWB, plus areas of disinterest.
 * Manual Exposure Rates
 * Predictable AWE/AE change periods.
 * Variable Frame Rates - more than the fixed values, 5 to 60 fps may be need to catch fast or slow movements.
 * Variety of Image sizes - Machine Vision sometimes needs image sizes of multiple dimensions. From QQVGA to SVGA+ sizes.
 * Color Formats - Typically Human Vision needs RGB color space, while Machine Vision tends to operate in YUV with some side channels using RGB.

= Implementation =

Languages
DVP is implemented in C with some C99 extensions. GCC and Microsoft's CL compiler can both compile the majority of DVP. DVP does contain some NEON ARM assembly (see Writing ARM Assembly) which is in the AT&amp;T assembly style.

DVP contains other components which are implemented in C++ (VisionCam/VisionEngine). These are convenience classes used to simplify usage of DVP within a HLOS environment.

Supported HLOS

 * Froyo and Gingerbread Android - using Android makefiles.
 * QNX - using Concerto
 * Ubuntu Linux - "Host" Build where the reference "C" versions of the kernels are used. Using Concerto build.
 * Windows NT - "Host" Build where the reference "C" versions of the kernels are used. Using Concerto build.

Android Specific Issues

 * Android Froyo and Gingerbread do not support a generic shmget (shared memory allocation) and thus a "work-around" was used which allows multiple DVP clients to allocate from a shared memory area via a native system service, "shm_service" which is implemented in the SOSAL.
 * Android does not allow access to the OMX-CAMERA unless you are a "root" privileged process (due to the need to read DCC files which are marked as root privilege among other reasons).
 * Once you access the OMX-CAMERA in root mode, you need a Binder interface for callers to use to access your Daemon's features.

OMAP4

 * SIMCOP
 * C64T
 * A9 (NEON) (multithreaded)
 * OpenCL1
 * Cloud2

OMAP5

 * SIMCOP
 * C64T
 * A15 (NEON) (multithreaded)
 * M4 (NEON) (multithreaded)
 * OpenCL1
 * Cloud2

PC (Ubuntu/Windows)

 * CPU (multithreaded)
 * OpenCL1
 * Cloud2

1: On platforms which enable OpenCL. 2: ON platforms with Network connectivity and when an EC2 RPC is implemented.

= Supported Kernels =

DVP has some algorithm kernels which will be released "openly" with DVP.
 * YUV - a set of NEON accelerated image processing functions.
 * IMGFILTER - a 3x3 image convolution library written in NEON.
 * OCL - A demonstration of calling kernels from OpenCL.
 * DSPLIB - a generic set of DSP compute functions.
 * VRUN - a subset of vision kernels which are accelerable by the OMAP ISS.

= Other Components =

SOSAL
SOSAL is a very simple operating system abstraction layer plus design pattern library which allow for rapid development. It contains:
 * OS: Threads, Mutex, Semaphore, Events, Sockets
 * Patterns: List, Hash, Queue, Ring Buffer, RPC over Sockets

Display
Display is a critical piece of development code which allows for programmers to see the images coming from the camera or the output from kernels. Supported Display Techs are:
 * V4L2 (Output)
 * LibScreen
 * GTK

VisionCam
VisionCam is a C++ Wrapper around the OMX-Camera interface which aims to simplify the OMX interface sytle down to the bare-minimum needed to enable Machine Vision applications.

Subclasses
VisionCam has several subclasses which allow for various stages of development. They include:
 * UVCVisionCam - wraps the V4L2 capture interface for USB Video Camera (UVC) drivers on Ubuntu PCs. This is used for early development.
 * SocketVisionCam - exposes the VisionCam interface over RPC over Sockets to the Host PC (Ubuntu or Windows). This is used for later develop which still relies on PC code.
 * OMXVisionCam - on production OMAP code bases it uses the OMX-Camera interface.
 * FileVisionCam - wraps read RAW image data from a file, either an .avi or .yuv/.rgb file. Used for testing.

Using VisionCam in Socket Mode
On the Android device (Blaze/Tablet/etc)
 * 1) vcam_server

On the Host, build DVP using the instructions below for your platform. Connect your platform to the PC via microUSB. Then execute: $ adb forward tcp:8501 tcp:8501 $ adb forward tcp:8502 tcp:8502

To get single (front) camera image: $ vcam_simple -t 3 --name localhost -w 160 -h 120 -c NV12 -s 1

To get the stereo (front) camera image (Top-Bottom) on Blaze: $ vcam_simple -t 3 --name localhost -w 160 -h 240 -c NV12 -s 2 -tb

VisionEngine
VisionEngine is a utility C++ class used to implement Machine Vision applications which contains a thread, a reference to VisionCam and DVP.

Base Class Features

 * Multiple Graphs Supported
 * Multiple Camera Ports Supported
 * Default Dequeue implements a Frame Dropper to keep current.
 * DelayCameraSettings implements a frame delayed focus control loop.

Dual Port Support
The VisionEngine (on latest develop, post RLS_1.80) supports Multiple Camera Ports and Multiple Graphs. Each port may be associated with multiple graphs. When GraphUpdate receives a VisionCamFrame with a specific port, the subclass must update the appropriate graphs using the m_correlation variable.