CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
[GSoC] High Level API and Samples for Scene Text Detection and Recognition #17570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
if (maxLoc > 0) { | ||
char currentChar = vocabulary[maxLoc - 1]; | ||
if (currentChar != decodeSeq[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decodeSeq[-1]
-1
is illegal, maybe use decodeSeq.back()
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix it.
I got the right result with index -1, so I thought a negative index may be supported now. :(
Thank you.
samples/dnn/scene_text_detection.cpp
Outdated
"{ help h | | Print help message. }" | ||
"{ inputImage i | | Path to an input image. Skip this argument to capture frames from a camera. }" | ||
"{ device d | 0 | camera device number. }" | ||
"{ modelPath mp | | Path to a binary .onnx file contains trained DB detector model.}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious -- is there a pretrained onnx model (link) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can upload models to Google Drive.
I am not sure about it, and I will ask my mentor.
modules/dnn/test/test_model.cpp
Outdated
@@ -370,6 +389,21 @@ TEST_P(Test_Model, Segmentation) | |||
testSegmentationModel(weights_file, config_file, inp, exp, norm, size, mean, scale, swapRB); | |||
} | |||
|
|||
TEST_P(Test_Model, SceneTextRec) | |||
{ | |||
std::string imgPath = _tf("welcome.png"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already put the data into opencv/opencv_extra, but it has not been merged yet.
opencv/opencv_extra#773
The name of image has been changed into "text_rec_test.png", and I will push again when the data is ready.
Thanks for your review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please push the new name. You don't need to wait the merge. See this step in the build pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Sadly, I still get an error in "Linux x64 Debug", but I find that the test_dnn in "Linux x64" passed. I am not sure about the difference between these two tests.
Is there any detailed log information containing which line throws out the error.
I can only get "error: (-215: Assertion failed) dims <= 2 in function 'at' thrown in the test body." in https://pullrequest.opencv.org/buildbot/builders/precommit_linux64_no_opt/builds/24676/steps/test_dnn/logs/stdio
I think I have tested the API successfully, and you can see more information in https://github.com/HannibalAPE/opencv/blob/text_det_recog_demo/doc/tutorials/dnn/dnn_scene_text_det_and_rec/scene_text_recognition.markdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still get an error in "Linux x64 Debug", but I find that the test_dnn in "Linux x64" passed. I am not sure about the difference between these two tests.
Obviously, it is "Debug" mode (with extra checks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alalek Thanks for your reminder.
@vpisarev I suppose that we could not have a single PR looking at the TODO list in the description. |
1b4d47d
to
9c81421
Compare
@HannibalAPE, I'm now trying to run the code. None of the models you provided can be read with the version of OpenCV that is in text_det_recog_demo. BTW, hash sum for onnx/models/DB_IC15_resnet50.onnx does not match as well when I run download_models.py script. Could you please check that the models are imported correctly. Make sure that in CMake the version of protobuf from OpenCV is used, not the system version. |
Is it not right? |
@vpisarev Hi Vadim, I have tested it on other Linux-based systems (e.g. Ubuntu 18.04 and 16.04). On both of them, my models can be imported and output the right results. Can you provide some information about your issue, then I can reproduce it. |
@HannibalAPE, thank you. I've downloaded the latest DB_IC15_resnet50.onnx and it works well. It's noticeably slower than EAST detector, but the results are definitely better! However, with the model DB_TD500_resnet50.onnx it still crashes with the following message:
I'm using macOS 10.15.6, xcode 11.6. Protobuf is 3.5.1. Will try it on Linux tomorrow or on Friday. Some other complains: As I said, the speed is not that good. I tried to play with "--inputWidth" and "--inputHeight" parameters.
can you please check if inputWidth/inputHeight work at all? == The text recognition sample says it supports live video capture from camera, but if I run it without image, it complains that the image is not set. From the code I can conclude that it does not support live video capture. Can you modify scene_text_detection sample to support recognition as well? It would be very useful demonstration on how to use detection and recognition together. |
@vpisarev It may be caused by the wrong input size. For DB_TD500_resnet50.onnx, the image shape should be set to 736x736, which is mentioned in both the tutorial and the sample.
These two parameters are actually prepared for different models, because these models are trained on different benchmarks.
Currently, it only supports a predefined shape. I will try to update it to support dynamic shape. But I think there will be a drop in accuracy, because the inference shape is not consistent with the training shape.
This sample and some new tutorials will be pushed in this weak. |
6fbaf85
to
cb2279e
Compare
8ba1956
to
01ebd94
Compare
0daa5de
to
7ead0f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution!
|
||
```cpp | ||
#include <iostream> | ||
#include <fstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create standalone .cpp file for example code and embed snippets into documentation only.
You can adopt sample file below or create new one.
PNG lossless format is not necessary here (files are large for real world images). Try to reduce image size using the |
Sample code for tutorials should to into samples/cpp/tutorial_code/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Please take a look on left comments
samples/dnn/scene_text_detection.cpp
Outdated
} else { | ||
// Open an image file | ||
CV_Assert(parser.has("inputImage")); | ||
Mat frame = imread(parser.get<String>("inputImage")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use samples::findFile()
for better file searching experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated all imread
with samples::findFile()
@dkurt Please take look on public API proposals. |
e633676
to
8a12190
Compare
/** | ||
* @brief Given the @p input frame, create input blob, run net and return recognition result. | ||
* @param[in] frame: The input image. | ||
* @param[in] decodeType: The decoding method of translating the network output into string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which are possible values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only supports "CTC-greedy" now, and I will add more decoding methods in the future, such as Beam Searching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User should know the options so it must be documented or we can add enum
* @param[in] binThresh: The threshold of the binary map. | ||
* @param[in] polyThresh: The threshold of text polygons. | ||
* @param[in] unclipRatio: The unclip ratio of the detected text region. | ||
* @param[in] maxCandidates: The max number of the output polygons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to limit this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some difficult cases, the output map of the network is full of small noise, and maxCandidates
can avoid wasting inference time.
@alalek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HannibalAPE Thank you for contribution!
I will update introduced public APIs and push them here till the end of this week.
Update: Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alalek Thank you! I learn a lot with your help.
@@ -52,7 +52,7 @@ int main(int argc, char** argv) | |||
// Load vocabulary | |||
CV_Assert(!vocPath.empty()); | |||
std::ifstream vocFile; | |||
vocFile.open(vocPath); | |||
vocFile.open(samples::findFile(vocPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether I need to add samples::findFile()
whenever I open files?
Do I need to change TextRecognitionModel recognizer(modelPath)
into TextRecognitionModel recognizer(samples::findFile(modelPath))
?
Does it slow the speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
samples::findFile()
helps to use file by name (instead of full path) from <opencv>/samples/data
location.
Model file is not stored there (and no plans to put it there due its size).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed updated public API for text detection / recognition tasks. Please take a look and check samples / documentation (probably I miss something).
modules/dnn/src/model.cpp
Outdated
{ | ||
CV_TRACE_FUNCTION(); | ||
std::vector< std::vector<Point> > contours = detectTextContours(frame); | ||
confidences = std::vector<float>(contours.size(), 1.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any confidences
scoring in DB detection algo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can regard the return of contourScore()
as a kind of confidence
, but it is only used to filter some bad detection results. It is not the same meaning as those in general object detection algorithms.
@alalek Since the onnx importer has supported models with dynamic input shape, I will try to generate new models of DB and update them in this week. I will also check samples and tutorials in this week. |
update opencv.bib
0f10240
to
9aebabe
Compare
- avoid 90/180 degree turns
* @return array with text detection results | ||
*/ | ||
CV_WRAP | ||
std::vector<cv::RotatedRect> detect(InputArray frame) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually RotatedRect
doesn't work well with perspective transformations (like this one).
Perhaps we need 4 points in API with strong order (bl, tl, tr, br - according to targetVertices
) which should be used with getPerspectiveTransform()
to get more accurate results.
I will try to update API on this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not open the above link. Can you share it with google drive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is similar to side boxes of cube from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, DL models may not work with perspective transformations (EAST output doesn't know anything about that) and they just detect rotated text.
any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on different DL models. Some methods can output irregular quadrilaterals.
From my side, the perspective transformation is a temporary replacement, or just a choice for the four-point outputs. (rotated boxes and irregular quadrilaterals)
There is a popular and fast text recognition algorithm ASTER
which adopts Thin Plate Spline Transformation
in its rectification network. (like this).
I am not sure whether the TPS transformation is implemented in OpenCV, maybe not?
You can refer to this.
By the way, ASTER is also an algorithm from our lab, and we are glad to contribute it to OpenCV.
However, there are some things to do before.
- support TPS transformation
- update LSTM in modules/dnn/src/onnx/onnx_importer.cpp
we need to set these parameters non-zero, but it is not supported now. - support GRU layer
- ...
I suppose to work on it after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated TextDetectionModel
API:
- added quadrangle support with strong requirement for order of returned points
- dropped
.detectTextCountours()
fromTextDetectionModel_DB
(replaced by quadrangles).
Examples: | ||
```bash | ||
example_dnn_scene_text_recognition -mp=path/to/crnn_cs.onnx -i=path/to/an/image -rgb=1 -vp=/path/to/alphabet_94.txt | ||
example_dnn_scene_text_detection -mp=path/to/DB_TD500_resnet50.onnx -i=path/to/an/image -ih=736 -iw=736 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-mp=path/to/DB_TD500_resnet50.onnx
-ih=736 -iw=736
Please check model parameters here (and above near model download links).
This set performs better: -ih=736 -iw=1280
(on IC15/test_images/img_5.jpg
)
BTW, It make sense to put some defaults into TextDetectionModel_DB
ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, DB_TD500_resnet50.onnx and DB_IC15_resnet50.onnx are prepared for different datasets (i.e. TD500 and IC15) respectively, which aims to perform better performance on each benchmark. The recommended settings do not cover the above case (use DB_TD500_resnet50.onnx on IC15 images).
If it is needed, I can train a new model with different datasets together. What is your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation. That makes sense.
I can train a new model
This can be an activity after this PR merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a tutorial on training custom models and converting existing models to onnx?
* Each result is quadrangle's 4 points in this order: | ||
* - bottom-left | ||
* - top-left | ||
* - top-right | ||
* - bottom-right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added strong requirements for results to avoid "points reordering" in sample code.
CV_WRAP | ||
void detect( | ||
InputArray frame, | ||
CV_OUT std::vector< std::vector<Point> >& detections, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts about Point
vs Point2f
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Point
is okay.
(1) keep points' order consistent with (bl, tl, tr, br) in unclip (2) update contourScore with boundingRect
void setNMSThreshold(float nmsThreshold_) { nmsThreshold = nmsThreshold_; } | ||
float getNMSThreshold() const { return nmsThreshold; } | ||
|
||
// TODO: According to article EAST supports quadrangles output: https://arxiv.org/pdf/1704.03155.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find that the performance of QUAD is worse than RBOX in Table. 3 and the authors do not provide official code and models.
Some good re-implementations of EAST only support RBOX.
TF: https://github.com/argman/EAST
PyTorch: https://github.com/songdejia/EAST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM π
@HannibalAPE @bhack Thank you for contribution!
[GSoC] High Level API and Samples for Scene Text Detection and Recognition * APIs and samples for scene text detection and recognition * update APIs and tutorial for Text Detection and Recognition * API updates: (1) put decodeType into struct Voc (2) optimize the post-processing of DB * sample update: (1) add transformation into scene_text_spotting.cpp (2) modify text_detection.cpp with API update * update tutorial * simplify text recognition API update tutorial * update impl usage in recognize() and detect() * dnn: refactoring public API of TextRecognitionModel/TextDetectionModel * update provided models update opencv.bib * dnn: adjust text rectangle angle * remove points ordering operation in model.cpp * update gts of DB test in test_model.cpp * dnn: ensure to keep text rectangle angle - avoid 90/180 degree turns * dnn(text): use quadrangle result in TextDetectionModel API * dnn: update Text Detection API (1) keep points' order consistent with (bl, tl, tr, br) in unclip (2) update contourScore with boundingRect
Merge with extra: opencv/opencv_extra#773
High-Level API and Samples for Scene Text Detection and Recognition
This is my project in GSoC 2020: OpenCV Text/Digit detection & recognition projects.
Short Video: https://drive.google.com/file/d/1IlGpRRhPCifC9TRzuhq0_G1P6MkP33BJ/view?usp=sharing
For more information:
https://github.com/HannibalAPE/opencv/blob/text_det_recog_demo/doc/tutorials/dnn/dnn_text_spotting/dnn_text_spotting.markdown
TODO LIST:
Scene Text Recognition:
Samples
High-Level API (CRNN)
Scene Text Detection:
Samples
High-Level API (DB & EAST)
Scene Text Spotting:
Samples
Document:
Tutorials
Pull Request Readiness Checklist
But the patch to opencv_extra does not have the same branch name. see as follows.