Merge release r1.20.0 to main (#7167)

ericharper · artbataev · karpnv · jubick1337 · commit caff0c921cd5 · 2023-08-07T17:55:50.000-07:00
* update package info Signed-off-by: ericharper <complex451@gmail.com> * Add ASR with TTS Tutorial. Fix enhancer usage. (#6955) * Add ASR with TTS Tutorial * Fix enhancer usage Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> * install_bs (#7019) Signed-off-by: Nikolay Karpov <karpnv@gmail.com> * Fix typo and branch in tutorial (#7048) Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> * fix syntax error introduced in PR-7079 (#7102) * fix syntax error introduced in PR-7079 Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fixes for pr review Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> --------- Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> * fix links for TN (#7117) Signed-off-by: Evelina <ebakhturina@nvidia.com> * update branch (#7135) Signed-off-by: ericharper <complex451@gmail.com> * Fixed main and merging this to r1.20 (#7127) * Fixed main and merging this to r1.20 Signed-off-by: Taejin Park <tango4j@gmail.com> * Update vad_utils.py Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * update branch Signed-off-by: ericharper <complex451@gmail.com> * fix version Signed-off-by: ericharper <complex451@gmail.com> * resolve conflict the other way Signed-off-by: ericharper <complex451@gmail.com> * keep both Signed-off-by: ericharper <complex451@gmail.com> * revert keep both Signed-off-by: ericharper <complex451@gmail.com> --------- Signed-off-by: ericharper <complex451@gmail.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: Nikolay Karpov <karpnv@gmail.com> Signed-off-by: Alexandra Antonova <antonova_sasha@list.ru> Signed-off-by: Evelina <ebakhturina@nvidia.com> Signed-off-by: Taejin Park <tango4j@gmail.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Nikolay Karpov <karpnv@gmail.com> Co-authored-by: bene-ges <antonova_sasha@list.ru> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Taejin Park <tango4j@gmail.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: jubick1337 <mattyson.so@gmail.com>
diff --git a/Dockerfile b/Dockerfile
@@ -94,7 +94,7 @@ COPY . .
 
 # start building the final container
 FROM nemo-deps as nemo
-ARG NEMO_VERSION=1.20.0
+ARG NEMO_VERSION=1.21.0
 
 # Check that NEMO_VERSION is set. Build will fail without this. Expose NEMO and base container
 # version information as runtime environment variable for introspection purposes
diff --git a/nemo/collections/asr/parts/utils/vad_utils.py b/nemo/collections/asr/parts/utils/vad_utils.py
@@ -732,7 +732,7 @@ def generate_vad_segment_table(
     vad_pred_filepath_list = [os.path.join(vad_pred_dir, x) for x in os.listdir(vad_pred_dir) if x.endswith(suffixes)]
 
     if not out_dir:
-        out_dir_name = "seg_output_"
+        out_dir_name = "seg_output"
         for key in postprocessing_params:
             out_dir_name = out_dir_name + "-" + str(key) + str(postprocessing_params[key])
 
diff --git a/nemo/package_info.py b/nemo/package_info.py
@@ -14,7 +14,7 @@
 
 
 MAJOR = 1
-MINOR = 20
+MINOR = 21
 PATCH = 0
 PRE_RELEASE = 'rc0'
 
diff --git a/tutorials/asr/Offline_ASR.ipynb b/tutorials/asr/Offline_ASR.ipynb
@@ -655,4 +655,4 @@
       "outputs": []
     }
   ]
-}
+}
diff --git a/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb b/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb
@@ -934,7 +934,9 @@
         "id": "9T3CZcCAmxCz"
       },
       "source": [
-        "Now we have a folder with generated audios `audio/*.wav` and a nemo manifest with json records like `{\"audio_filepath\": \"audio/0.wav\", \"text\": \"no renal auditory or vestibular toxicity was observed\", \"orig_text\": \"No renal, auditory, or vestibular toxicity was observed.\"}`."
+        "Now we have a folder with generated audios `audio/*.wav` and a nemo manifest with json records like `{\"audio_filepath\": \"audio/0.wav\", \"text\": \"no renal auditory or vestibular toxicity was observed\", \"orig_text\": \"No renal, auditory, or vestibular toxicity was observed.\"}`.",
+        "\n",
+        "Note that TTS model may mispronounce some unknown words, for example, abbreviations like `tRNAs`."
       ]
     },
     {
diff --git a/tutorials/tools/CTC_Segmentation_Tutorial.ipynb b/tutorials/tools/CTC_Segmentation_Tutorial.ipynb
@@ -280,7 +280,7 @@
         "* `max_length` argument - max number of words in a segment for alignment (used only if there are no punctuation marks present in the original text. Long non-speech segments are better for segments split and are more likely to co-occur with punctuation marks. Random text split could deteriorate the quality of the alignment.\n",
         "* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, and the text will be changed to lowercase \n",
         "* sentences for alignment with the original punctuation and capitalization will be stored under  `$OUTPUT_DIR/processed/*_with_punct.txt`\n",
-        "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`https://github.com/NVIDIA/NeMo-text-processing/blob/r1.19.0/tutorials/Text_(Inverse)_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo-text-processing/blob/r1.19.0/tutorials/Text_(Inverse)_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. See [https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo-text-processing/blob/r1.19.0/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
+        "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`https://github.com/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. See [https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
         "\n",
         "### Audio preprocessing:\n",
         "* non '.wav' audio files will be converted to `.wav` format\n",

Original file line number	Diff line number	Diff line change
`@@ -655,4 +655,4 @@`
`655`	`655`	`"outputs": []`
`656`	`656`	`}`
`657`	`657`	`]`
`658`		`-}`
	`658`	`+}`
Original file line number	Diff line number	Diff line change
`@@ -934,7 +934,9 @@`
`934`	`934`	`"id": "9T3CZcCAmxCz"`
`935`	`935`	`},`
`936`	`936`	`"source": [`
`937`		- "Now we have a folder with generated audios `audio/*.wav` and a nemo manifest with json records like `{\"audio_filepath\": \"audio/0.wav\", \"text\": \"no renal auditory or vestibular toxicity was observed\", \"orig_text\": \"No renal, auditory, or vestibular toxicity was observed.\"}`."
	`937`	+ "Now we have a folder with generated audios `audio/*.wav` and a nemo manifest with json records like `{\"audio_filepath\": \"audio/0.wav\", \"text\": \"no renal auditory or vestibular toxicity was observed\", \"orig_text\": \"No renal, auditory, or vestibular toxicity was observed.\"}`.",
	`938`	`+ "\n",`
	`939`	+ "Note that TTS model may mispronounce some unknown words, for example, abbreviations like `tRNAs`."
`938`	`940`	`]`
`939`	`941`	`},`
`940`	`942`	`{`