Skip to content

Commit 5d0b064

Browse files
authored
Add speech recognition C++ example (#1538)
1 parent fad19fa commit 5d0b064

14 files changed

+774
-1
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,6 @@
22
path = third_party/kaldi/submodule
33
url = https://github.com/kaldi-asr/kaldi
44
ignore = dirty
5+
[submodule "examples/libtorchaudio/simplectc"]
6+
path = examples/libtorchaudio/simplectc
7+
url = https://github.com/mthrok/ctcdecode

examples/libtorchaudio/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
build
22
data/output.wav
3-
data/pipeline.zip
3+
*.zip
4+
output

examples/libtorchaudio/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,6 @@ message("libtorchaudio CMakeLists: ${TORCH_CXX_FLAGS}")
1414
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
1515

1616
add_subdirectory(../.. libtorchaudio)
17+
add_subdirectory(simplectc)
1718
add_subdirectory(augmentation)
19+
add_subdirectory(speech_recognition)

examples/libtorchaudio/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Libtorchaudio Examples
22

33
* [Augmentation](./augmentation)
4+
* [Speech Recognition with wav2vec2.0](./speech_recognition)
45

56
## Build
67

@@ -14,6 +15,7 @@ It is currently not distributed, and it will be built alongside with the applica
1415
The following commands will build `libtorchaudio` and applications.
1516

1617
```bash
18+
git submodule update
1719
mkdir build
1820
cd build
1921
cmake -GNinja \

examples/libtorchaudio/build.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ build_dir="${this_dir}/build"
88
mkdir -p "${build_dir}"
99
cd "${build_dir}"
1010

11+
git submodule update
1112
cmake -GNinja \
1213
-DCMAKE_PREFIX_PATH="$(python -c 'import torch;print(torch.utils.cmake_prefix_path)')" \
1314
-DBUILD_SOX=ON \

examples/libtorchaudio/simplectc

Submodule simplectc added at b1a30d7
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
add_executable(transcribe transcribe.cpp)
2+
add_executable(transcribe_list transcribe_list.cpp)
3+
target_link_libraries(transcribe "${TORCH_LIBRARIES}" "${TORCHAUDIO_LIBRARY}" "${CTCDECODE_LIBRARY}")
4+
target_link_libraries(transcribe_list "${TORCH_LIBRARIES}" "${TORCHAUDIO_LIBRARY}" "${CTCDECODE_LIBRARY}")
5+
set_property(TARGET transcribe PROPERTY CXX_STANDARD 14)
6+
set_property(TARGET transcribe_list PROPERTY CXX_STANDARD 14)
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Speech Recognition with wav2vec2.0
2+
3+
This example demonstarates how you can use torchaudio's I/O features and models to run speech recognition in C++ application.
4+
5+
**NOTE**
6+
This example uses `"sox_io"` backend for loading audio, which does not work on Windows. To make it work on
7+
Windows, you need to replace the part of loading audio and converting it to Tensor object.
8+
9+
## 1. Create a transcription pipeline TorchScript file
10+
11+
We will create a TorchScript that performs the following processes;
12+
13+
1. Load audio from a file.
14+
1. Pass audio to encoder which produces the sequence of probability distribution on labels.
15+
1. Pass the encoder output to decoder which generates transcripts.
16+
17+
For building decoder, we borrow the pre-trained weights published by `fairseq` and/or Hugging Face Transformers, then convert it `torchaudio`'s format, which supports TorchScript.
18+
19+
### 1.1. From `fairseq`
20+
21+
For `fairseq` models, you can download pre-trained weights
22+
You can download a model from [`fairseq` repository](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec). Here, we will use `Base / 960h` model. You also need to download [the letter dictionary file](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#evaluating-a-ctc-model).
23+
24+
For the decoder part, we use [simple_ctc](https://github.com/mthrok/ctcdecode), which also supports TorchScript.
25+
26+
```bash
27+
mkdir -p pipeline-fairseq
28+
python build_pipeline_from_fairseq.py \
29+
--model-file "wav2vec_small_960.pt" \
30+
--dict-dir <DIRECTORY_WHERE_dict.ltr.txt_IS_FOUND> \
31+
--output-path "./pipeline-fairseq/"
32+
```
33+
34+
The above command should create the following TorchScript object files in the output directory.
35+
36+
```
37+
decoder.zip encoder.zip loader.zip
38+
```
39+
40+
* `loader.zip` loads audio file and generate waveform Tensor.
41+
* `encoder.zip` receives waveform Tensor and generates the sequence of probability distribution over the label.
42+
* `decoder.zip` receives the probability distribution over the label and generates a transcript.
43+
44+
### 1.2. From Hugging Face Transformers
45+
46+
47+
[Hugging Face Transformers](https://huggingface.co/transformers/index.html) and [Hugging Face Model Hub](https://huggingface.co/models) provides `wav2vec2.0` models fine-tuned on variety of datasets and languages.
48+
49+
We can also import the model published on Hugging Face Hub and run it in our C++ application.
50+
In the following example, we will try the Geremeny model, ([facebook/wav2vec2-large-xlsr-53-german](https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german/tree/main)) on [VoxForge Germany dataset](http://www.voxforge.org/de/downloads).
51+
52+
```bash
53+
mkdir -p pipeline-hf
54+
python build_pipeline_from_huggingface_transformers.py \
55+
--model facebook/wav2vec2-large-xlsr-53-german \
56+
--output-path ./pipeline-hf/
57+
```
58+
59+
The resulting TorchScript object files should be same as the `fairseq` example.
60+
61+
## 2. Build the application
62+
63+
Please refer to [the top level README.md](../README.md)
64+
65+
## 3. Run the application
66+
67+
Now we run the C++ application [`transcribe`](./transcribe.cpp), with the TorchScript object we created in Step.1.1. and an input audio file.
68+
69+
```bash
70+
../build/speech_recognition/transcribe ./pipeline-fairseq ../data/input.wav
71+
```
72+
73+
This will output something like the following.
74+
75+
```
76+
Loading module from: ./pipeline/loader.zip
77+
Loading module from: ./pipeline/encoder.zip
78+
Loading module from: ./pipeline/decoder.zip
79+
Loading the audio
80+
Running inference
81+
Generating the transcription
82+
I HAD THAT CURIOSITY BESIDE ME AT THIS MOMENT
83+
Done.
84+
```
85+
86+
## 4. Evaluate the pipeline on Librispeech dataset
87+
88+
Let's evaluate this word error rate (WER) of this application using [Librispeech dataset](https://www.openslr.org/12).
89+
90+
### 4.1. Create a list of audio paths
91+
92+
For the sake of simplifying our C++ code, we will first parse the Librispeech dataset to get the list of audio path
93+
94+
```bash
95+
python parse_librispeech.py <PATH_TO_YOUR_DATASET>/LibriSpeech/test-clean ./flist.txt
96+
```
97+
98+
The list should look like the following;
99+
100+
```bash
101+
head flist.txt
102+
103+
1089-134691-0000 /LibriSpeech/test-clean/1089/134691/1089-134691-0000.flac HE COULD WAIT NO LONGER
104+
```
105+
106+
### 4.2. Run the transcription
107+
108+
[`transcribe_list`](./transcribe_list.cpp) processes the input flist list and feed the audio path one by one to the pipeline, then generate reference file and hypothesis file.
109+
110+
```bash
111+
../build/speech_recognition/transcribe_list ./pipeline-fairseq ./flist.txt <OUTPUT_DIR>
112+
```
113+
114+
### 4.3. Score WER
115+
116+
You need `sclite` for this step. You can download the code from [SCTK repository](https://github.com/usnistgov/SCTK).
117+
118+
```bash
119+
# in the output directory
120+
sclite -r ref.trn -h hyp.trn -i wsj -o pralign -o sum
121+
```
122+
123+
WER can be found in the resulting `hyp.trn.sys`. Check out the column that starts with `Sum/Avg` the first column of the third block is `100 - WER`.
124+
125+
In our test, we got the following results.
126+
127+
| model | Fine Tune | test-clean | test-other |
128+
|:-----------------------------------------:|----------:|:----------:|:----------:|
129+
| Base<br/>`wav2vec_small_960` | 960h | 3.1 | 7.7 |
130+
| Large<br/>`wav2vec_big_960` | 960h | 2.6 | 5.9 |
131+
| Large (LV-60)<br/>`wav2vec2_vox_960h_new` | 960h | 2.9 | 6.2 |
132+
| Large (LV-60) + Self Training<br/>`wav2vec_vox_960h_pl` | 960h | 1.9 | 4.5 |
133+
134+
135+
You can also check `hyp.trn.pra` file to see what errors were made.
136+
137+
```
138+
id: (3528-168669-0005)
139+
Scores: (#C #S #D #I) 7 1 0 0
140+
REF: there is a stone to be RAISED heavy
141+
HYP: there is a stone to be RACED heavy
142+
Eval: S
143+
```
144+
145+
## 5. Evaluate the pipeline on VoxForge dataset
146+
147+
Now we use the pipeline we created in step 1.2. This time with German language dataset from VoxForge.
148+
149+
### 5.1. Create a list of audio paths
150+
151+
Download an archive from http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/, and extract it to your local file system, then run the following to generate the file list.
152+
153+
```bash
154+
python parse_voxforge.py <PATH_TO_YOUR_DATASET> > ./flist-de.txt
155+
```
156+
157+
The list should look like
158+
159+
```bash
160+
head flist-de.txt
161+
de5-001 /datasets/voxforge/de/guenter-20140214-afn/wav/de5-001.wav ES SOLL ETWA FÜNFZIGTAUSEND VERSCHIEDENE SORTEN GEBEN
162+
```
163+
164+
### 5.2. Run the application and score WER
165+
166+
This process is same as the Librispeech example. We just use the pipeline with the Germany model and file list of Germany dataset. Refer to the corresponding ssection in Librispeech evaluation..
167+
168+
```bash
169+
../build/speech_recognition/transcribe_list ./pipeline-hf ./flist-de.txt <OUTPUT_DIR>
170+
```
171+
172+
Then
173+
174+
```bash
175+
# in the output directory
176+
sclite -r ref.trn -h hyp.trn -i wsj -o pralign -o sum
177+
```
178+
179+
You can find the detail of evalauation result in PRA.
180+
181+
```
182+
id: (guenter-20140214-afn/mfc/de5-012)
183+
Scores: (#C #S #D #I) 4 1 1 0
184+
REF: die ausgaben kÖnnen gigantisch STEIGE N
185+
HYP: die ausgaben kÖnnen gigantisch ****** STEIGEN
186+
Eval: D S
187+
```

0 commit comments

Comments
 (0)