speech recognition sample #20291

spazewalker · 2021-06-21T18:09:24Z

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

Mentor : @l-bat
Project proposal : https://summerofcode.withgoogle.com/projects/#5148521881141248

PR details

Creating ONNX model

NVIDIA trained jasper using FP16 precision. OpenCV needs FP32. We need to change onnx model's graph. This is done using this script : convert_jasper_to_FP32.py. Pre-trained converted onnx can be found here. Original pre-trained model by NVIDIA can be found here.

Usage

usage: speech_recognition.py [-h] --input_audio INPUT_AUDIO [--show_spectrogram] [--model MODEL] [--output OUTPUT] [--backend {0,2,3}] [--target {0,1,2}]
This script runs Jasper Speech recognition model
optional arguments:
  -h, --help            show this help message and exit
  --input_audio INPUT_AUDIO
                        Path to input audio file. OR Path to a txt file with relative path to multiple audio files in different lines (default: None)
  --show_spectrogram    Whether to show a spectrogram of the input audio. (default: False)
  --model MODEL         Path to the onnx file of Jasper. default="jasper.onnx" (default: jasper.onnx)
  --output OUTPUT       Path to file where recognized audio transcript must be saved. Leave this to print on console. (default: None)
  --backend {0,2,3}     Select a computation backend: 0: automatically (by default) 2: OpenVINO Inference Engine 3: OpenCV Implementation (default: 0)
  --target {0,1,2}      Select a target device: 0: CPU target (by default) 1: OpenCL 2: OpenCL FP16 (default: 0)

Todo

Use AudioIO instead of soundfile.
Check performance.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Docs

samples/dnn/speech_recognition.py

l-bat · 2021-06-23T07:26:14Z

Please add description at the beginning of sample as in

opencv/samples/dnn/human_parsing.py

Lines 2 to 40 in b74aae6

    
           ''' 
        
           You can download the converted pb model from https://www.dropbox.com/s/qag9vzambhhkvxr/lip_jppnet_384.pb?dl=0 
        
           or convert the model yourself. 
        
           Follow these steps if you want to convert the original model yourself: 
        
               To get original .meta pre-trained model download https://drive.google.com/file/d/1BFVXgeln-bek8TCbRjN6utPAgRE0LJZg/view 
        
               For correct convert .meta to .pb model download original repository https://github.com/Engineering-Course/LIP_JPPNet 
        
               Change script evaluate_parsing_JPPNet-s2.py for human parsing 
        
               1. Remove preprocessing to create image_batch_origin: 
        
                   with tf.name_scope("create_inputs"): 
        
                   ... 
        
               Add 
        
                   image_batch_origin = tf.placeholder(tf.float32, shape=(2, None, None, 3), name='input') 
        
               2. Create input 
        
                   image = cv2.imread(path/to/image) 
        
                   image_rev = np.flip(image, axis=1) 
        
                   input = np.stack([image, image_rev], axis=0) 
        
               3. Hardcode image_h and image_w shapes to determine output shapes. 
        
                  We use default INPUT_SIZE = (384, 384) from evaluate_parsing_JPPNet-s2.py. 
        
                   parsing_out1 = tf.reduce_mean(tf.stack([tf.image.resize_images(parsing_out1_100, INPUT_SIZE), 
        
                                                           tf.image.resize_images(parsing_out1_075, INPUT_SIZE), 
        
                                                           tf.image.resize_images(parsing_out1_125, INPUT_SIZE)]), axis=0) 
        
                  Do similarly with parsing_out2, parsing_out3 
        
               4. Remove postprocessing. Last net operation: 
        
                   raw_output = tf.reduce_mean(tf.stack([parsing_out1, parsing_out2, parsing_out3]), axis=0) 
        
                  Change: 
        
                   parsing_ = sess.run(raw_output, feed_dict={'input:0': input}) 
        
               5. To save model after sess.run(...) add: 
        
                   input_graph_def = tf.get_default_graph().as_graph_def() 
        
                   output_node = "Mean_3" 
        
                   output_graph_def = tf.graph_util.convert_variables_to_constants(sess, input_graph_def, output_node) 
        
                   output_graph = "LIP_JPPNet.pb" 
        
                   with tf.gfile.GFile(output_graph, "wb") as f: 
        
                       f.write(output_graph_def.SerializeToString())' 
        
           '''

How to get FP32 ONNX model from pre-trained model
Provide link to the converted model

samples/dnn/speech_recognition.py

l-bat · 2021-06-28T06:37:42Z

samples/dnn/speech_recognition.py

+if __name__ == '__main__':
+
+    # Computation backends supported by layers
+    backends = (cv.dnn.DNN_BACKEND_DEFAULT, cv.dnn.DNN_BACKEND_OPENCV)


Could you try forward net with OpenVINO (cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)?

I tried. It gave this error: error: (-213:The function/feature is not implemented) Unknown backend identifier in function 'cv::dnn::dnn4_v20210301::wrapMat'

samples/dnn/speech_recognition.py

l-bat · 2021-06-29T08:24:31Z

samples/dnn/speech_recognition.py

+
+    parser = argparse.ArgumentParser(description='This script runs Jasper Speech recognition model',
+                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument('--input_audio', type=str, help='Path to input audio file.')


Do we need to specify supported audio formats?

I think we need to add required=True

Finally, we need to use AudioIO. So, should I add the formats supported there? I suppose mp3, wav and mp4 are supported.

samples/dnn/speech_recognition.py

alalek · 2021-08-22T13:25:38Z

@spazewalker Could you please check if approach from #20558 works for this case?

spazewalker · 2021-08-22T14:35:37Z

@spazewalker Could you please check if approach from #20558 works for this case?

@alalek Just tested it. It works for this case.

support for multiple files at once

Co-authored-by: Liubov Batanina <piccione-mail@yandex.ru> fix whitespaces

alalek · 2021-09-27T20:53:00Z

Lets merge it with soundfile workaround.

@spazewalker Please make PR to "Ready for review" if it is ready for merging.

alalek · 2021-10-02T23:07:56Z

"Ready for review"

@spazewalker Ping. Or let us know if you want to improve something else.

spazewalker · 2021-10-03T03:55:23Z

@alalek I'm actually waiting for #19721 to get merged. I think videoio would replace the soundfile.

alalek

Thank you 👍

speech recognition sample * speech recognition sample added.(initial commit) * fixed typos, removed plt * trailing whitespaces removed * masking removed and using opencv for displaying spectrogram * description added * requested changes and add opencl fp16 target * parenthesis and halide removed * workaround 3d matrix issue * handle multi channel audio support for multiple files at once * suggested changes fix whitespaces

speech recognition sample added.(initial commit)

8f3f246

l-bat added the GSoC label Jun 21, 2021

l-bat reviewed Jun 21, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

fixed typos, removed plt

cc5c445

spazewalker changed the title ~~speech recognition sample added.(initial commit)~~ speech recognition sample Jun 22, 2021

trailing whitespaces removed

0825f1f

l-bat added the category: dnn label Jun 23, 2021

masking removed and using opencv for displaying spectrogram

b74aae6

description added

28a3269

l-bat reviewed Jun 28, 2021

View reviewed changes

requested changes and add opencl fp16 target

e42d86c

l-bat reviewed Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

l-bat reviewed Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

parenthesis and halide removed

010cc97

l-bat reviewed Jun 29, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

spazewalker added 2 commits July 14, 2021 01:01

Merge branch 'opencv:master' into master

b5a8d00

Merge branch 'opencv:master' into master

d57bde7

l-bat reviewed Aug 12, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

l-bat reviewed Aug 20, 2021

View reviewed changes

samples/dnn/speech_recognition.py Outdated Show resolved Hide resolved

spazewalker marked this pull request as ready for review August 22, 2021 16:57

spazewalker marked this pull request as draft August 22, 2021 16:58

spazewalker added 3 commits August 22, 2021 22:35

workaround 3d matrix issue

5e9f115

handle multi channel audio

62099d5

support for multiple files at once

suggested changes

5ddfa7d

Co-authored-by: Liubov Batanina <piccione-mail@yandex.ru> fix whitespaces

alalek approved these changes Oct 3, 2021

View reviewed changes

spazewalker marked this pull request as ready for review October 3, 2021 06:33

alalek merged commit 4938765 into opencv:master Oct 4, 2021

alalek mentioned this pull request Oct 15, 2021

(5.x) Merge 4.x #20886

Merged

Uh oh!

speech recognition sample #20291

speech recognition sample #20291

Uh oh!

Conversation

spazewalker commented Jun 21, 2021 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

PR details

Creating ONNX model

Usage

Todo

Pull Request Readiness Checklist

Uh oh!

Uh oh!

Uh oh!

l-bat commented Jun 23, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l-bat Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

spazewalker Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

l-bat Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

l-bat Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

spazewalker Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalek commented Aug 22, 2021

Uh oh!

spazewalker commented Aug 22, 2021

Uh oh!

alalek commented Sep 27, 2021

Uh oh!

alalek commented Oct 2, 2021

Uh oh!

spazewalker commented Oct 3, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spazewalker commented Jun 21, 2021 •

edited by alalek

Loading