Find contours speedup #26834

chacha21 · 2025-01-23T20:51:20Z

This Pull request is still incomplete.
It is an attempt, as suggested by #26775, to restore lost speed when migrating findContours() implementation from C to C++

The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code of findContours() is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.

As mentioned in #26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.

The TreeNodes use another Arena class that does not comply to that range logic.

Currently, there is a significant improvement of the run-time on the test mentioned in #26775, but it is still far from the findContours_legacy() performance.

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

mshabunin · 2025-01-25T20:39:17Z

I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.

Name of Test	before	new	after	new vs before (x-factor)	after vs before (x-factor)
findContours_big::TestFindContours	41.837	75.655	61.264	0.55	0.68
malloc calls	200K	23M	13M

Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disabling index_mapping access in the cv::contourTreeToResults fuinction and get performance ~45ms - comparable with old variant.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

chacha21 · 2025-01-25T22:11:39Z

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping. A stack allocation would help, but the current test case has a tree.size() equal to 127384 elements, which is too large for that.

the stack does not use a growing capacity, vector should perform less allocations

Avoid allocation in TreeIterator by using a specific storage using an arena No visible improvements on the test case

The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations. Still no visible performance improvement on test case.

mshabunin · 2025-01-26T13:58:45Z

contourTreeToResults() is called once, so there is certainly no improvement by pre-allocating the memory needed for index_mapping

Did you measure performance difference with index_mapping update on your platform? For me it gave 60 ms -> 45 ms improvement on Linux/GCC.

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.

This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

chacha21 · 2025-01-26T17:02:53Z

On my machine (Win7, VS2019)

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms

Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API MemStorage. Thus we will avoid vector reallocations which happens currently.
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance.

OK, I misunderstood as "preallocate index_mapping memory".

On my host machine, it is still not so close to legacy

mshabunin · 2025-01-26T19:42:15Z

Please take a look at this commit: 5f35ea3
This is block storage which I was writing about. Though I'm not sure how to make it more concise.
On my machine it works for 42 ms VS 40 ms for legacy function. Number of malloc calls is 275K VS 185K. There is still some room for improvement (for example, perhaps we dont need to store reference to storage in each contour, but add is an argument in corresponing methods), but overall it look much better.

chacha21 · 2025-01-26T20:58:03Z

Please take a look at this commit: 5f35ea3

~~[EDIT]:test_impgproc fails with this commit~~
Ok, it fails because the modifications are not backported to approximateChainTC89

What I understand :

this commit is about replacing the Arena (which has a policy of "fixed size, then dynamic allocation when static capacity is reached") by a structure performing incremental dynamic allocations by blocks.
I can see that you tested a static size of 600000 bytes for the arena, because of the current test case generating a high number of points. Of course, it would not be a good idea to allocate that much memory for the standard case. Hence it seems doomed, and the Block storage might be better in general.
the Block storage is not conceptually very different from a vector. It performs dynamic allocations as it grows.
the difference is that a vector does not assume that realloc() might work, and will always copy all its previous content when resized, while the block storage will just add new blocks
your BlockStorage is similar to the vectorWithArena that I tried previously for TreeIterator::levels (but vectorWithArena allocated blocks from an arena instead of the standard new)
for the current test case, requiring a high amount of memory, a small arena with a fixed stack-allocated buffer won't help, but for the general case it could still be useful for the general case.

~~A will test the performance with your commit on my host machine ASAP.~~

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms

I think that I will try to evolve to a BlockStorageWithArena for the general case
Another idea : actually, I am not sure that contour points should be stored as int2(x,y), but rather int(offset). The width information that allows offset->(x,y) can be stored once in the context. I am curious if the saved memory will result in a speed up. There will be an overhead when converting to contour results, a mere memcopy being impossible after that.

mshabunin · 2025-01-27T09:35:46Z

As I understand this algorithm requires memory storage with :

O(1) push_back
O(1) random access
elements for each contour are added sequentially
minimum number of allocations and deallocations

std::vector satisfies 1-3, but not 4 - due to reallocations.
std::deque should work like this (1-3), but it seems that it uses quite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well

I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)

Not sure if I understand. Wouldn't offset need to be 2D?

chacha21 · 2025-01-27T10:56:21Z

As I understand this algorithm requires memory storage with : [...]
* std::vector satisfies 1-3, but not 4 - due to reallocations.
* std::deque should work like this (1-3), but it seems that it uses quite small blocks (especially in MSVC - 16 bytes or 1 element), thus 4 is not satisfied as well
I thought about using std::deque with custom allocator and larger blocks, but it seems to make things too complicated.

I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.

Do we really need arena in this case? Wouldn't it add extra overhead for keeping items?

The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
However, even better, I think that you could just tune your BlockStorage with one static buffer holding the first blocks (that will be allocated on stack if the BlockStorage itself is allocated on stack before creating the ContourScanner). When more blocks are needed, the dynamic allocation will occur.
No need to delegate that to an arena.

I am not sure that contour points should be stored as int2(x,y), but rather int(offset)
Not sure if I understand. Wouldn't offset need to be 2D?

Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there)
(and we ignore int overflow for very large images)

removed all arenas in favor of BlockStorage added static blocks to BlockStorage added range iteration for block storage to optimize copy data out of it make BlockStorage allocated on stack before ContourScanner instanciation tested storage of points as int(offset), but it was slower on test cases

chacha21 · 2025-01-27T14:00:56Z

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms

a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things. Nothing was better so far.

even unused and empty, std::vector<schar> codes allocates memory using a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation

chacha21 · 2025-01-27T19:26:25Z

original 4.11 legacy : ~133ms
original 4.11 : ~210ms
my initial PR : ~173ms
PR+index mapping as vector : ~149ms
PR+index mapping as vector + TreenodeIterator::levels as ranges : ~147ms
Block storage for contour points + index mapping as vector + TreenodeIterator::levels as ranges : ~142ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges : ~139ms
no arena + enhanced Block storage + index mapping as vector + TreenodeIterator::levels as ranges + codes also stored in a BlockStorage: ~133ms

removed harmless debugging code, raising warning with some compilers

chacha21 · 2025-01-28T06:24:08Z

I ran standard perf tests to see if there are any regression.
It's hard to compare old vs new perstats for perf_impgproc(--gtest_filter=*indContours*), since the number of samples per test varies from run to run.
What's the good practice ? (I can't run python)

mshabunin · 2025-01-29T12:30:23Z

Number of samples is chosen from the predefined range (10-100 by default) taking into account relative performance measurment deviation (median VS mean) and overall test execution time (should not exceed 3 sec by default). Maybe something else.

Use following options to tune the parameters:

--perf_min_samples / --perf_force_samples to modify range
--perf_max_deviation - to change deviation threshold
--perf_time_limit - to change execution time limit

The simplest variant would be to set fixed number of samples using the same value of min_samples and max_samples.

BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations:

#include "perf_precomp.hpp"
#include "opencv2/imgproc/detail/legacy.hpp"
namespace opencv_test { namespace {
PERF_TEST(TestFindContours, findContours_big)
{
    Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
    cv::Mat mask;
    inRange(img, 110, 120, mask);
    std::vector<std::vector<cv::Point>> v_cont;
    TEST_CYCLE() cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
    SANITY_CHECK_NOTHING();
}
PERF_TEST(TestFindContours, findContours_big_old)
{
    Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
    cv::Mat mask;
    inRange(img, 110, 120, mask);
    std::vector<std::vector<cv::Point>> v_cont;
    TEST_CYCLE() cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
    SANITY_CHECK_NOTHING();
}
} } // namespace

chacha21 · 2025-01-29T13:06:05Z

Ok, I ran perf_test for *indContours* with current 4.x vs current PR
There is probably no significant difference, but PR timings are almost always slightly better.
Sometimes (it differs from run to run), some tests might have <= 2ms regression, but it's probably noise.
In my opinion, this is an overall improvement.

chacha21 · 2025-01-31T06:43:00Z

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
The idea is that a referenced part of a block storage (a pair <first,last>) could be ask to be "released", giving a chance to really free it if it happens to be the tail of the block storage.

in that case, the block storage can just size -= (last-first)
but it is unclear if the block storage must free such released memory (if a block can be freed) or keep it as a remaining capacity (certainly the best choice)
I assume that adding size_t capacity() const, void resize() {dangerous if referenced areas}, void shrink_to_fit() are standards that one could expect
the "release part" operation would be if (last == storage.size()) storage.resize(storage.size()-(last-first))

There is also the temptation (bait ?) to mimic an allocator by

adding an std::pair<size_t, size_t> alloc(size) that would usually append memory like the actual push_back()
adding a bool release(const std::pair<size_t, size_t>&) that could use the "release tail if possible", no-op otherwise
adding the ability to also mark released areas at the head (like a double-ended buffer), so that future allocations that would fit could reuse that memory (storage start)...(head)(data)...(data)(tail)...(storage end)

And also

add a writeTo(size_t first, size_t size, const T* data) that would handle the RangeIterator logic instead of making it public.
add a readFrom(size_t first, size_t size, T* data)

That's not a big work, but does it belong to that PR ?

mshabunin · 2025-02-02T11:28:39Z

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case.

I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to the core module (private interface, like BufferArea), extend functionality and cover with tests.

chacha21 · 2025-02-04T15:41:48Z

I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?

I just performed the test :

On the machine where I developed the PR, Windows 7, vs2019

cv::findContours_legacy() : ~133ms
current PR cv::findContours() : ~137ms

On another Windows 10, vs2022

cv::findContours_legacy() : ~66ms
current PR cv::findContours() : ~68ms

mshabunin · 2025-02-19T18:48:05Z

modules/imgproc/src/contours_common.hpp

    Tree<T>& tree;
+    vectorOfRanges<int> levels;


I'm observing better performance with std::stack (though less stable):

On x86_64 (Core i5-11600, Linux):

stack: samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%)

vectorOfRanges: samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)

On AArch64 (Rockchip RK3588):

stack: samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%)

vectorOfRanges: samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)

I propose to leave the stack version. Also it is better for simplicity.

modules/imgproc/src/contours_common.hpp

-class BlockStorgae has its own file -TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges) -ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage -ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorage I have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.

mshabunin

Looks good to me. With minor cosmetic comments.

modules/imgproc/src/contours_common.hpp

gather preliminary work into one commit

085e8e3

chacha21 force-pushed the findContours_speedup branch from bb242cc to 085e8e3 Compare January 23, 2025 20:55

better C++ grammar for gcc/clang

9278de7

asmorkalov changed the title ~~Find contours speedup~~ WIP: Find contours speedup Jan 24, 2025

asmorkalov requested a review from mshabunin January 24, 2025 06:09

asmorkalov added the category: imgproc label Jan 24, 2025

Do not use map for contour indexing

4dd8408

chacha21 added 5 commits January 25, 2025 23:25

use vector instead of stack

4a341bc

the stack does not use a growing capacity, vector should perform less allocations

changed an assert

c3347a4

try to minimize allocations

a75ad49

Avoid allocation in TreeIterator by using a specific storage using an arena No visible improvements on the test case

fixed typo not detected by MSVC

e8bcc40

fixed move

73bba4d

Merge branch '4.x' into findContours_speedup

1ca692a

chacha21 added 3 commits January 27, 2025 15:03

typo not detected at compile time

c0b747a

use abstraction for Contour::codes

b9e2403

a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things. Nothing was better so far.

use BlockStorage for codes

1875ef6

even unused and empty, std::vector<schar> codes allocates memory using a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation

remove unused debugging code

ce01235

removed harmless debugging code, raising warning with some compilers

mshabunin reviewed Feb 19, 2025

View reviewed changes

chacha21 added 3 commits February 20, 2025 12:14

Merge remote-tracking branch 'upstream/master' into findContours_speedup

9938b26

typo

e6f4e8a

mshabunin approved these changes Mar 11, 2025

View reviewed changes

modules/imgproc/src/contours_common.hpp Show resolved Hide resolved

modules/imgproc/src/contours_common.hpp Outdated Show resolved Hide resolved

indentation

fd90ee4

asmorkalov assigned mshabunin Mar 12, 2025

asmorkalov added this to the 4.12.0 milestone Mar 12, 2025

asmorkalov changed the title ~~WIP: Find contours speedup~~ Find contours speedup Mar 12, 2025

asmorkalov merged commit d83df66 into opencv:4.x Mar 12, 2025
27 of 28 checks passed

asmorkalov mentioned this pull request Apr 29, 2025

5.x merge 4.x #27265

Merged

Uh oh!

Find contours speedup #26834

Find contours speedup #26834

Uh oh!

Conversation

chacha21 commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshabunin commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chacha21 commented Jan 25, 2025

Uh oh!

mshabunin commented Jan 26, 2025

Uh oh!

chacha21 commented Jan 26, 2025

Uh oh!

mshabunin commented Jan 26, 2025

Uh oh!

chacha21 commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshabunin commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chacha21 commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chacha21 commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chacha21 commented Jan 27, 2025

Uh oh!

chacha21 commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshabunin commented Jan 29, 2025

Uh oh!

chacha21 commented Jan 29, 2025

Uh oh!

chacha21 commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshabunin commented Feb 2, 2025

Uh oh!

chacha21 commented Feb 4, 2025

Uh oh!

mshabunin Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mshabunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chacha21 commented Jan 23, 2025 •

edited

Loading

mshabunin commented Jan 25, 2025 •

edited

Loading

chacha21 commented Jan 26, 2025 •

edited

Loading

mshabunin commented Jan 27, 2025 •

edited

Loading

chacha21 commented Jan 27, 2025 •

edited

Loading

chacha21 commented Jan 27, 2025 •

edited

Loading

chacha21 commented Jan 28, 2025 •

edited

Loading

chacha21 commented Jan 31, 2025 •

edited

Loading