Speechcatcher

This is a Python utility to interface Speechcatcher EspNet2 models. You can transcribe media files and use the utility for live transcription. All models are trained end-to-end with punctuation - the ASR model is able to output full text directly, without the need for punctuation reconstruction. Speechcatcher runs fast on CPUs and does not need a GPU to transcribe your audio.

Speechcatcher supports German, English, and Spanish ASR. More models and languages will follow - stay tuned!

News

10/12/2025. New in version 0.5.0: English and Spanish model support! Use -m en_streaming_transformer_l for English or -m es_streaming_transformer_l for Spanish. New experimental native decoder implementation with --decoder native. Configurable logging with --log-level. Thanks to Wordcab Inc. sponsorship for making this release possible!
8/21/2025. New in version 0.4.2: Python3.13 compatibilty, made speechcatcher-server compatible with the new websockets>=14 API (see also https://websockets.readthedocs.io/en/stable/howto/upgrade.html).
1/7/2025. New in version 0.4.1: new and improved dynamic endpointing, improved error messages.
8/19/2024. New in version 0.4.0: Speechcatcher now has a websocket server (speechcatcher_server) for live transcription.
6/25/2024. New in version 0.3.2: Speechcatcher is now Python 3.12 compatible! Under certain conditions some input files would produce an error on the last segment, this now fixed in this version.
12/15/2023. New in version 0.3.1: Support for timestamps on the token level. Speechcatcher is now using its own espnet_streaming_decoder instead of using espnet directly, to make dependencies leaner and to enable token timestamps with streaming models. Speechcatcher does not require a full Espnet installation anymore. It also uses a forked version of espnet_model_zoo, so that model downloads are only checked online if a local cache copy isn't available.

Installation instructions:

Install portaudio and a few other dependencies, on Mac:

brew install portaudio ffmpeg git git-lfs

on Linux (Ubuntu 24.04):

sudo apt-get install portaudio19-dev python3.12-dev ffmpeg libhdf5-dev git git-lfs build-essential

on Linux (Fedora):

sudo dnf install portaudio-devel python3 python3-pip python3-devel ffmpeg hdf5-devel git git-lfs gcc gcc-c++ make automake autoconf

For a system-wide and global installation, simply do:

pip3 install git+https://github.com/speechcatcher-asr/speechcatcher

Virtual environment

If you prefer an installation in a virtual environment, create one first. For example with python3.12:

virtualenv -p python3.12 speechcatcher_env

Note, if you get "-bash: virtualenv: command not found", install virtualenv through pip:

#sudo pip3 install virtualenv

Activate it:

source speechcatcher_env/bin/activate

If you want a CPU-only version of speechcatcher, install a CPU-only pytorch version with:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Then install speechcatcher:

pip3 install git+https://github.com/speechcatcher-asr/speechcatcher

Run speechcatcher from the command line

After you have succesfully installed speechcatcher, you can decode any media file with:

speechcatcher media_file.mp4

By default, this uses the German de_streaming_transformer_xl model. To use a different model, use the -m flag:

# Transcribe an English audio file
speechcatcher -m en_streaming_transformer_l media_file.mp4
# Transcribe a Spanish audio file
speechcatcher -m es_streaming_transformer_l media_file.mp4
# Use a smaller but faster German model
speechcatcher -m de_streaming_transformer_m media_file.mp4

To transcribe data live from your microphone:

speechcatcher -l

or with English:

speechcatcher -m en_streaming_transformer_l -l

or to launch a Vosk compatible websocket server for live transcription on ws://localhost:2700/

speechcatcher_server --vosk-output-format --port 2700

or with Spanish:

speechcatcher_server --model es_streaming_transformer_l --vosk-output-format --port 2700

All required model files are downloaded automatically and placed into a ".cache" directory.

Use speechcatcher in your Python code

To use speechcatcher in your Python script, you need to import the speechcatcher package and use the recognize function in a 'main' guarded block. Here is a complete example, that reads a 16kHz mono wav and outputs the recognized text:

from speechcatcher import speechcatcher
import numpy as np
from scipy.io import wavfile
# you need to run speechcatcher in a '__main__' guarded block:
if __name__ == '__main__':
    # You can select from German, English, or Spanish models.
    # Available models:
    # German (default): 'de_streaming_transformer_m', 'de_streaming_transformer_l', 'de_streaming_transformer_xl'
    # English: 'en_streaming_transformer_m', 'en_streaming_transformer_l'
    # Spanish: 'es_streaming_transformer_m', 'es_streaming_transformer_l'
    short_tag = 'de_streaming_transformer_xl'  # German (default, best accuracy)
    # short_tag = 'en_streaming_transformer_l'  # English (large model)
    # short_tag = 'es_streaming_transformer_l'  # Spanish (large model)
    speech2text = speechcatcher.load_model(speechcatcher.tags[short_tag])
    wav_file = 'input.wav'
    rate, audio_data = wavfile.read(wav_file)
    speech = audio_data.astype(np.int16)
    print(f"Sample Rate: {rate} Hz")
    print(f"Audio Shape: {audio_data.shape}")
    # speech is a numpy array of dtype='np.int16' (16bit audio with 16kHz sampling rate)
    complete_text, paragraphs = speechcatcher.recognize(speech2text, speech, rate, quiet=True, progress=False)
    # complete_text is a string with the complete decoded text
    print(complete_text)
    # -> Faust. Eine Tragödie von Johann Wolfgang von Goethe. Zueignung. Ihr naht euch wieder, schwankende Gestalten...
    # paragraphs contains a list of paragraphs with additional information, such as start and end position,
    # token and token_timestamps (upper bound, in seconds)
    print(paragraphs)
    
    # -> [{'start': 0, 'end': 44.51, 'text': 'Faust. Eine Tragödie von Johann Wolfgang von Goethe. Zueignung. Ihr naht euch wieder, schwankende Gestalten...', 'tokens': ['▁F', 'aus', 't', '.', '▁Ein', 'e', '▁Tra', 'g', 'ö', 'di', 'e', '▁von', '▁Jo', 'ha', 'n', 'n', '▁Wo', 'l', 'f', 'gang', '▁von', '▁G', 'o', 'et', 'he', '.', '▁Zu', 'e', 'ig', 'n', 'ung', '.', '▁I', 'hr', '▁', 'na', 'ht', '▁euch', '▁wieder', ',', '▁sch', 'wa', 'n', 'ken', 'de', '▁Ge', 'st', 'al', 'ten', '.', ... ],
    # 'token_timestamps': [1.666, 2.333, 2.333, 3.0, 3.0, 3.0, 3.0, 3.0, 3.666, 4.333, 4.333, 4.333, 5.0, 5.0, 5.0, 5.0, 5.0, 5.666, 5.666, 5.666, 6.333, 6.333, 6.333, 7.0, 7.666, 7.666, 7.666, 7.666, 8.333, 9.666, 9.666, 9.666, 9.666, 9.666, 9.666, 10.333, 10.333, 11.0, 11.666, 11.666, 11.666, 11.666, 11.666, 12.333, 12.333, 12.333, 13.0, 13.666, 14.333, 14.333, 14.333, 14.333, 14.333, 14.333, 14.333, ... ]}, ... ]

Available models

German models

Acoustic model	Training data (hours)	Tuda-raw test WER (without LM)	CER
de_streaming_transformer_m	13k	11.57	3.38
de_streaming_transformer_l	13k	9.65	2.76
de_streaming_transformer_xl	26k	8.5	2.44

Note: Tuda-de-raw results are based on raw lowercased tuda-de test utterances without the normalization step. It may not be directly comparable to regular tuda-de results.

English models

Acoustic model	Training data (hours)	Test WER	CER
en_streaming_transformer_m	35k	TBD	TBD
en_streaming_transformer_l	35k	TBD	TBD

Spanish models

Acoustic model	Training data (hours)	Test WER	CER
es_streaming_transformer_m	35k	TBD	TBD
es_streaming_transformer_l	35k	TBD	TBD

Speechcatcher CLI parameters

Speechcatcher utility to decode speech with speechcatcher espnet models.
positional arguments:
  inputfile             Input media file
options:
  -h, --help            show this help message and exit
  -l, --live-transcription
                        Use microphone for live transcription
  -t MAX_RECORD_TIME, --max-record-time MAX_RECORD_TIME
                        Maximum record time in seconds (live transcription).
  -m MODEL, --model MODEL
                        Choose a model. German: de_streaming_transformer_{m,l,xl}. Spanish: es_streaming_transformer_{m,l}.
                        English: en_streaming_transformer_{m,l}. Or provide a HuggingFace model ID or URL.
  -d DEVICE, --device DEVICE
                        Computation device. Either 'cpu' or 'cuda'. Note: Mac M1 / mps support isn't available yet.
  --lang LANGUAGE       Explicitly set language, default is empty = deduct language from model tag
  -b BEAMSIZE, --beamsize BEAMSIZE
                        Beam size for the decoder
  --decoder {native,espnet}
                        Decoder implementation: "espnet" (default) or "native" (experimental)
  --fp16                Use FP16 (half precision) for faster inference. Only supported with native decoder.
  --disable-bbd         Disable Block Boundary Detection (BBD). Only applies to native decoder.
                        BBD prevents repetition but may cause early stopping with subword tokenization
                        (default: enabled to match ESPnet).
  --quiet               No partial transcription output when transcribing a media file
  --no-progress         Show no progress bar when transcribing a media file
  --no-exception-on-overflow
                        Do not abort live recognition when encountering an input overflow.
  --save-debug-wav      Save recording to debug.wav, only applicable to live decoding
  --num-threads NUM_THREADS
                        Set number of threads used for intraop parallelism on CPU in pytorch.
  --cache-dir CACHE_DIR
                        Directory where model downloads are cached.
  -n NUM_PROCESSES, --num-processes NUM_PROCESSES
                        Set number of processes used for processing long audio files in parallel
                        (the input file needs to be long enough). If set to -1, use multiprocessing.cpu_count() divided by two.
  --chunk-length CHUNK_LENGTH
                        Number of raw audio samples per chunk for streaming processing (default: 8192)
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set logging level (default: ERROR). Use WARNING to see ESPnet N-best warnings.

Speechcatcher websocket parameters

Speechcatcher WebSocket Server for streaming ASR
options:
  -h, --help            show this help message and exit
  --host HOST           Host for the WebSocket server
  --port PORT           Port for the WebSocket server
  --model {de_streaming_transformer_m,de_streaming_transformer_l,de_streaming_transformer_xl,es_streaming_transformer_m,es_streaming_transformer_l,en_streaming_transformer_m,en_streaming_transformer_l}
                        Model to use for ASR. German: de_streaming_transformer_{m,l,xl}.
                        Spanish: es_streaming_transformer_{m,l}. English: en_streaming_transformer_{m,l}.
  --device {cpu,cuda}   Device to run the ASR model on ('cpu' or 'cuda')
  --beamsize BEAMSIZE   Beam size for the decoder
  --cache-dir CACHE_DIR
                        Directory for model cache
  --format {wav,mp3,mp4,s16le,webm,ogg,acc}
                        Audio format for the input stream
  --pool-size POOL_SIZE
                        Number of speech2text instances to preload
  --vosk-output-format  Enable Vosk-like output format
  --finalize-update-iters FINALIZE_UPDATE_ITERS
                        Number of iterations with no new update from the ASR util an utterance is finalized.
  --max_partial_iters MAX_PARTIAL_ITERS
                        Maximum number of iterations until utterance finalization is forced.

Speechcatcher training

Speechcatcher models are trained by using Whisper large as a teacher model:

See speechcatcher-data for code and more info on replicating the training process.

Citation

If you use speechcatcher models in your research, for now just cite this repository:

@misc{milde2023speechcatcher,
  author = {Milde, Benjamin},
  title = {Speechcatcher},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/speechcatcher-asr/speechcatcher}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
.github/workflows		.github/workflows
docs		docs
speechcatcher		speechcatcher
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
speechcatcher_de_live.gif		speechcatcher_de_live.gif
speechcatcher_server.service		speechcatcher_server.service
speechcatcher_test.log		speechcatcher_test.log
speechcatcher_training.svg		speechcatcher_training.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speechcatcher

News

Installation instructions:

Virtual environment

Run speechcatcher from the command line

Use speechcatcher in your Python code

Available models

German models

English models

Spanish models

Speechcatcher CLI parameters

Speechcatcher websocket parameters

Speechcatcher training

Citation

Sponsors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Languages

License

speechcatcher-asr/speechcatcher

Folders and files

Latest commit

History

Repository files navigation

Speechcatcher

News

Installation instructions:

Virtual environment

Run speechcatcher from the command line

Use speechcatcher in your Python code

Available models

German models

English models

Spanish models

Speechcatcher CLI parameters

Speechcatcher websocket parameters

Speechcatcher training

Citation

Sponsors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Languages

Packages