Introducing non-CPU HAL for OpenCV 5+

# Introduction

GPU-accelerated computing, which has been introduced in 2000's and helped to start AI revolution in 2011, is one of the main trends nowadays. Performance of GPUs and dedicated AI accelerators (called NPU's (neural processing units), TPU's (tensor processing units) etc.) increases significantly faster than the performance of CPUs. Those GPUs/NPUs are now equipped with special instructions and extended capabilities to run various sophisticated algorithms.

Until ~2012 OpenCV was purely CPU library, even though special optimizations using parallel loops and vector instructions have been actively added. That CPU-based acceleration direction is still relevant, see #25019. Then, we introduced CUDA-based acceleration modules in OpenCV, currently moved to opencv_contrib. In OpenCV 3.0 we also introduced [OpenCL-based Transparent API (T-API)](https://learnopencv.com/opencv-transparent-api/).

Besides using CUDA and OpenCL to accelerate basic functionality, we also added CUDA- and OpenCL-based backends in our Deep Learning inference module (OpenCV DNN) introduced in 2015. OpenCV DNN also includes other backends that use other standard or proprietary acceleration APIs, like Vulkan-based backend, CANN-based backend for Huawei Ascend, TimVX/OpenVX-based backend for Amlogic NPU etc.

There are several serious problems with the current approach that we want to solve in OpenCV 5.0, namely:

* For each acceleration API we added a dedicated matrix/image/tensor type: `UMat` for OpenCL, `GpuMat` for CUDA, `AclMat` for CANN etc. It's very inconvenient for OpenCV itself and for user applications.
* OpenCL is still supported by many vendors, but unfortunately it did not become a universal API (a *lingua franca*) for all accelerators. On mobile GPUs, e.g. in Android phones, most vendors prefer to offer Vulkan in addition or instead of OpenCL. NPU vendors all introduce their own APIs. Even Apple, who created OpenCL, has deprecated OpenCL and now promotes proprietary Metal API.
* Another problem with OpenCL is that it's not portable. A kernel developed for one platform with OpenCL may not run on another platform with OpenCL due to a differnt amount of local memory, due to custom OpenCL extensions used in the code etc. And if it runs, the performance is usually sub-optimal, sometimes much slower than specially-tuned variant.
* CUDA is more solid framework, and there are far fewer problems with code fragmentation, since it's one-vendor solution, but it has own disadvantages:
    * A huge overhead in terms of consumed disk space. Whereas OpenCL support in OpenCV adds just a few megabytes to the binary size, equivalent CUDA acceleration adds hundreds of megabytes. Possible workaround is to include PTX code only into OpenCV backend and use CUDA JIT compiler with on-disk cache (like in T-API) to produce machine code for concrete host GPU. And to use external libraries (like NVidia NPP, CUDNN) to accelerate basic functionality and deep learning kernels
    * OpenCV with enabled OpenCL support can run on machines without OpenCL. OpenCL runtime is detected and loaded on the fly. Whereas OpenCV built with CUDA support will require CUDA runtime installed. That is, the dependency, once it's created, is not optional.
* The same is true for the experimental [CANN-based acceleration](https://github.com/opencv/opencv_contrib/pull/3552) in opencv_contrib.
* The current model is that all acceleration backends must be put into OpenCV repository and OpenCV must be explicitly built with those accelerators enabled. There is no way for vendors to develop and maintain those backends by themselves, build and distribute it separately from OpenCV, e.g. as add-on Python modules.
* There is no well-defined methodology in OpenCV (besides T-API case) on how to develop kernels that have non-CPU accelerated branches. Usually we create a separate entry for CUDA-accelerated algorithm, Ascend-accelerated algorithm etc, which is inconvenient for users.
* There is no well-defined methodology in OpenCV (including T-API case) on how to develop high-level algorithms that will automatically get GPU/NPU-accelerated once the underlying kernels are CPU/GPU-accelerated.

# Proposal for OpenCV 5.0

* Operations derived from the same base class and following certain protocol:
    * All non-CPU acceleration kernels will follow stable versioned API specification that we will prepare for 5.0 and extend in further releases (see [Python array API standard](https://data-apis.org/array-api/latest/) or [ONNX specification](https://github.com/onnx/onnx/blob/main/docs/Operators.md) as good examples of what can be put into such specification).
    * Each operator will have its own unique name, like "org.opencv.add". Users, vendors may add their own extensions. For faster access of common operations there probably will be some IDs - substitutions for textual names.
    * Each operator will be derived directly or indirectly from a standard base class (which may be called `cv::hal::BaseOp` for example) and will have to implement certain protocol, like shape&type inference, evaluation of scratch buffer size, asynchronous execution etc.
* Data structures:
    * Each operator will take zero or more 'tensors' and scalars on input and zero or more 'tensors' and scalars on output. To represent those 'tensors' we plan to extend `cv::UMat` data type.
    * In OpenCV 3.x and 4.x `UMat` instances were all wrappers on top of OpenCL buffers when OpenCL runtime engine is detected, or they all were wrappers on top of system memory buffers when no OpenCL is used. In OpenCV 5.0 different instances of `UMat` may be allocated and handled by different backends, for example:

    ```
    using namespace cv;
    // capture video frame stored in OpenGL texture (perhaps, we don't need to know)
    UMat frame;
    vidcap >> frame;
    // convert frame from OpenGL or whatever representation to CUDA buffer stored at 0-th NVidia GPU, installed in the system.
    // in general case transferring data from one space to another will be done via system memory,
    // but some backends will provide more efficient mechanisms
    UMat cuda_frame = frame.upload(Device::NVGPU_(0));
    UMat cuda_processed_frame;
    // filter the frame, result will be placed onto the same device
    GaussianBlur(cuda_frame, cuda_processed_frame, Size(11, 11), 3, 3);
    // retrieve the result as cv::Mat for the further custom processing.
    Mat result = cuda_processed_frame.getMat(Access::READ_WRITE);
    ...
    ```
* Devices, Allocators:
    * There will be base class `Device` and some basic non-CPU HAL functions and classes, including memory allocators, will take `Device*` as parameter. `nullptr` could be used as equivalent for 'CPU device'. For other cases there will be helper functions that will return proper pointers, e.g. `Device::NVGPU_(device_index)`, `Device::GPU_(device_index)` (any GPU), `Device::defaultAccelerator()` (any GPU or NPU or CPU) etc.
    * Each `UMat` instance could return the device where it's located.
    * For each HAL backend there will be singleton class derived from `UMatAllocator`. It can allocate memory on the specified device, deallocate memory, upload memory block to specified device, download memory block to system memory, transfer memory to another device (via intermediate `download()` or directly, if possible), copy memory within the same device, initialize memory block with the specified scalar value, map/unmap memory to/from system memory if the device supports zero copy (if not, a physical copy is made).
    * It will be possible to request memory allocator to be used for certain device.
* Streams:
    * If possible, acceleration backends should be able to run operations asyncronously, as it's done now in Transparent API. That is, `GaussianBlur()` in the example above and most other functions return control back to user once an operation is scheduled for execution. Operations that return memory buffers in system memory (like `UMat::getMat()`) shall insert synchronization point before returning this buffer.
    * Typically, asyncronoous execution is done using streams, a.k.a. queues in OpenCL terminology. A stream/queue is an placeholder for new tasks that one wants to execute sequentially. Note that in the proposed API (just like in Transparent API) users don't have to deal with streams directly.
        * For each Device instance each CPU thread must provide thread-local default Stream, where all tasks should be put.
        * Different devices, like Intel iGPU via OpenCL and a pair of NVidia dGPUS via CUDA, may work in parallel, since they are using different streams.
    * Different subsequent operations may reuse the same scratch buffer. If a scratch buffer needs to be reallocated (increased), syncronization point must be inserted to make sure that all previous operations that may use the same scratch buffer are finished.
    * Temporary UMat's need to be protected. That is, some complex functions may use temporary UMat's.

        ```
        void foo(const UMat& input, UMat& output) {
            UMat temp1, temp2;
            op1(input, temp1);
            op2(temp1, temp2);
            op3(temp2, output);
        }
        ```

        since all the operations `op*` may be asyncronous, foo() may finish before all of the `op*` will finish and so `temp1` and `temp2` as local variables may be destructed while the operations are still running or waiting for execution in a stream. To protect them from premature release, each backend, if it provides asyncronous execution, must increment reference counter of all UMat arguments of each operation that is put into the stream, catch event when each operation is finished and then decrement reference counters back. This mechanism is already implemented in Transparent API.
    * There should probably be a special 'graph mode' flag for fast, light-weight scheduling of operations without extra protection of temporary matrices. This will be useful for OpenCV DNN backends.
* We want to ensure that different backends can be built separately from OpenCV and then dynamically connected to OpenCV. In order to do that, non-CPU HAL API should be put into a separate header that will not use C++ classes (well, there can be pure abstract C++ classes, pointers to which are returned by `extern "C"` functions).
* Some acceleration backends, like OpenCL or Vulkan, support JIT compilation, which adds a lot of flexibility and more opportunies for efficient graph fusion, and also let us to decrease the footprint drammaticaly (we don't have to put all precompiled kernels into binary, we may request backend to compile necessary kernels on the fly and store them in on-disk cache). There will be a dedicated API to allow such on-fly compilation.
* All OpenCV functions for which there are corresponding non-CPU HAL entries, e.g. `cv::gemm()` shall be modified to use the generalized dispatcher scheme, something like:

    ```
    void cv::foo(InputArray a, InputArray b, OutputArray c, const FooParams& params)
    {
        c.fitSameMemory(a, a.size, a.type);
        // get default backend for the device where 'a' is located
        hal::Backend* backend = a.backend();
        if (backend && backend->supports(CV_HAL_ID_FOO)) {
            CV_Assert(a.sameMemorySpace(b)); // some functions may support mixed-space ops as well
            // retrieve kernel pointer for the concrete set of input parameters.
            // Backends with JIT support may generate such a kernel on-fly
            // OpeCV DNN or other OpenCV components or user's applications with a 'state' may
            // retrieve those kernels once and store them for faster access.
            auto hal_foo_ptr = backend->get_foo_kernel(a.type(), b.type(), params);
            if (hal_foo_ptr && backend->run(hal_foo_ptr, {a, b, c}, params))
                return;
        }
        auto a_ = a.getMat(), b_ = b.getMat(), c_ = c.getMat();
        // fallback to CPU
    }
    ```

For OpenCV 5.0 the minimum plan is to introduce non-CPU HAL API, probably as a draft specification, and implement at least one backend, most probably OpenCL. After that, in 5.x we can create more backends, for example, CUDA backend.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Introducing non-CPU HAL for OpenCV 5+ #25025

Introduction

Proposal for OpenCV 5.0

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Introducing non-CPU HAL for OpenCV 5+ #25025

Description

Introduction

Proposal for OpenCV 5.0

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions