CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
Find contours speedup #26834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find contours speedup #26834
Conversation
bb242cc
to
085e8e3
Compare
I've tested "updated code" (after) VS "current 4.x" (new) VS "4.9" (old). Also profiled with callgrind to check number of memory allocations. Each measurment took 50 iterations using reproducer from the original issue.
Profiling with callgrind shows that updated code still have numerous malloc calls. I've been able to reduce this number even more (by ~3M or by ~6M, I'm not sure) by disabling Perhaps we should use some storage with blocks allocation in your arena, something similar to old C-API |
|
the stack does not use a growing capacity, vector should perform less allocations
Avoid allocation in TreeIterator by using a specific storage using an arena No visible improvements on the test case
The values stored in the `levels` member of TreeIterator seems to be mostly contiguous indices. Instead of allocating each index, we can just use ranges. It saves a lot of allocations. Still no visible performance improvement on test case.
Did you measure performance difference with
This comment was related to points stored in contours, not to TreeIterator. I believe TreeIterator doesn't have big impact on performance. |
On my machine (Win7, VS2019) original 4.11 legacy : ~133ms
OK, I misunderstood as "preallocate index_mapping memory". On my host machine, it is still not so close to legacy |
Please take a look at this commit: 5f35ea3 |
What I understand :
original 4.11 legacy : ~133ms I think that I will try to evolve to a BlockStorageWithArena for the general case |
As I understand this algorithm requires memory storage with :
I thought about using std::deque with custom allocator and larger blocks, but it seems it'll make things too complicated.
Not sure if I understand. Wouldn't offset need to be 2D? |
I think you are right. We should not rely on a structure which inner tuning depends heavily on the toolchain. Your BlockStorage let us control the amount of memory reliably.
The idea of the arena is just to provide fast memory on demand, I don't think that my ArenaStackBuffer has much overhead.
Yes, but as long as we keep track of width, offset = y*width+x, so (x, y) = (offset%width, offset/width). This is just a trick to save half the memory used to store one point (at the cost of the offset <-> (x,y) conversion overhead here and there) |
removed all arenas in favor of BlockStorage added static blocks to BlockStorage added range iteration for block storage to optimize copy data out of it make BlockStorage allocated on stack before ContourScanner instanciation tested storage of points as int(offset), but it was slower on test cases
original 4.11 legacy : ~133ms |
a std::vector<char> might not be optimal to store the codes (for instance uit always allocate 16 bytes, even empty), so I tried different things. Nothing was better so far.
even unused and empty, std::vector<schar> codes allocates memory using a specific ContourCodesStorage similar to ContourPointsStorage help saving that allocation
original 4.11 legacy : ~133ms |
removed harmless debugging code, raising warning with some compilers
I ran standard perf tests to see if there are any regression. |
Number of samples is chosen from the predefined range (10-100 by default) taking into account relative performance measurment deviation (median VS mean) and overall test execution time (should not exceed 3 sec by default). Maybe something else. Use following options to tune the parameters:
The simplest variant would be to set fixed number of samples using the same value of BTW, I used this basic test to reproduce issue from the original ticket and compare legacy and new implementations: #include "perf_precomp.hpp"
#include "opencv2/imgproc/detail/legacy.hpp"
namespace opencv_test { namespace {
PERF_TEST(TestFindContours, findContours_big)
{
Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
cv::Mat mask;
inRange(img, 110, 120, mask);
std::vector<std::vector<cv::Point>> v_cont;
TEST_CYCLE() cv::findContours(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
SANITY_CHECK_NOTHING();
}
PERF_TEST(TestFindContours, findContours_big_old)
{
Mat img = imread(getDataPath("cv/imgproc/contours.png"), cv::IMREAD_GRAYSCALE);
cv::Mat mask;
inRange(img, 110, 120, mask);
std::vector<std::vector<cv::Point>> v_cont;
TEST_CYCLE() cv::findContours_legacy(mask, v_cont, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);
SANITY_CHECK_NOTHING();
}
} } // namespace |
Ok, I ran perf_test for |
There is still some room to improve the capabilities of BlockStorage, but it won't change anything for the current test case. So I wonder if I should commit it or if it is just noise at this point.
There is also the temptation (bait ?) to mimic an allocator by
And also
That's not a big work, but does it belong to that PR ? |
I'm observing slight performance degradation in the orignal scenario with current version: PR ~ 48 ms, Legacy ~ 42 ms (Linux, GCC 11, x86_64). Is it OK on Windows? Could you please check?
I believe we don't need to add functionality we don't use right now. It can be useful to move block memory storage to separate file(s). Though eventually we might need to move it to the core module (private interface, like BufferArea), extend functionality and cover with tests. |
I just performed the test : On the machine where I developed the PR, Windows 7, vs2019
On another Windows 10, vs2022
|
Tree<T>& tree; | ||
vectorOfRanges<int> levels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm observing better performance with std::stack
(though less stable):
On x86_64 (Core i5-11600, Linux):
- stack:
samples=100 mean=44.01 median=42.88 min=40.59 stddev=2.44 (5.6%)
- vectorOfRanges:
samples=100 mean=46.97 median=46.97 min=46.65 stddev=0.16 (0.3%)
On AArch64 (Rockchip RK3588):
- stack:
samples=100 mean=79.23 median=79.02 min=78.35 stddev=1.27 (1.6%)
- vectorOfRanges:
samples=100 mean=89.51 median=89.52 min=88.90 stddev=0.21 (0.2%)
I propose to leave the stack version. Also it is better for simplicity.
-class BlockStorgae has its own file -TreeIterator::levels reverted to std::stack (get rid of vectorOfRanges) -ContourPointsStorage and ContourCodesStorage are template specialization of the same class ContourDataStorage -ContourPointsStorage and ContourCodesStorage us a static size of 0 for their inner BlockStorage I have run the tstandard findContours perf_tests to check that all those changes were not a slowdown for the average case (because the current use case of the PR must not take precedence). It's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. With minor cosmetic comments.
This Pull request is still incomplete.
It is an attempt, as suggested by #26775, to restore lost speed when migrating
findContours()
implementation from C to C++The patch adds an "Arena" (a pool) of pre-allocated memory so that contours points (and TreeNodes) can be picked from the Arena.
The code of
findContours()
is mostly unchanged, the arena usage being implicit through a utility class Arena::Item that provides C++ overloaded operators and construct/destruct logic.As mentioned in #26775, the contour points are allocated and released in order, and can be represented by ranges of indices in their arena. No range subset will be released and drill a hole, that's why the internal representation as a range of indices makes sense.
The TreeNodes use another Arena class that does not comply to that range logic.
Currently, there is a significant improvement of the run-time on the test mentioned in #26775, but it is still far from the
findContours_legacy()
performance.Patch to opencv_extra has the same branch name.