Skip to content

Commit 09f39c3

Browse files
erastorgueva-nvjubick1337
authored andcommitted
NFA subtitle file config - specify colors and vertical alignment (#7160)
* allow specifying colors of text in ASS subtitle file Signed-off-by: Elena Rastorgueva <[email protected]> * specify vertical_alignment instead of marginv in ass_file_config Signed-off-by: Elena Rastorgueva <[email protected]> * add documentation of CTMFileConfig and ASSFileConfig to NFA README Signed-off-by: Elena Rastorgueva <[email protected]> --------- Signed-off-by: Elena Rastorgueva <[email protected]> Signed-off-by: jubick1337 <[email protected]>
1 parent 2807c2b commit 09f39c3

File tree

3 files changed

+83
-16
lines changed

3 files changed

+83
-16
lines changed

tools/nemo_forced_aligner/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,26 @@ Each CTM file will contain lines of the format:
8282
`<utt_id> 1 <start time in seconds> <duration in seconds> <text, ie token/word/segment>`.
8383
Note the second item in the line (the 'channel ID', which is required by the CTM file format) is always 1, as NFA operates on single channel audio.
8484

85+
### `CTMFileConfig` parameters
86+
The `CTMFileConfig` (which is passed into the main NFA config) has the following parameters:
87+
* `remove_blank_tokens`: bool (default `False`) to specify if the token-level CTM files should have the timestamps of the blank tokens removed.
88+
* `minimum_timestamp_duration`: float (default `0`) to specify the minimum duration that will be applied to all timestamps. If any line in the CTM has a duration lower than this, it will be enlarged from the middle outwards until it meets the `minimum_timestamp_duration`, or reaches the beginning or end of the audio file. Note that using a non-zero value may cause timestamps to overlap.
89+
8590
# Output ASS file format
8691
NFA will produce the following ASS files, which you can use to generate subtitle videos:
8792
* ASS files with token-level highlighting will be at `<output_dir>/ass/tokens/<utt_id>.ass,`
8893
* ASS files with word-level highlighting will be at `<output_dir>/ass/words/<utt_id>.ass`.
8994
All words belonging to the same segment 'segments' will appear at the same time in the subtitles generated with the ASS files. If you find that your segments are not the right size, you can use set `ass_file_config.resegment_text_to_fill_space=true` and specify some number of `ass_file_config.max_lines_per_segment`.
9095

96+
### `ASSFileConfig` parameters
97+
The `ASSFileConfig` (which is passed into the main NFA config) has the following parameters:
98+
* `fontsize`: int (default value `20`) which will be the fontsize of the text
99+
* `vertical_alignment`: string (default value `center`) to specify the vertical alignment of the text. Can be one of `center`, `top`, `bottom`.
100+
* `resegment_text_to_fill_space`: bool (default value `False`). If `True`, the text will be resegmented such that each segment will not take up more than (approximately) `max_lines_per_segment` when the ASS file is applied to a video.
101+
* `max_lines_per_segment`: int (defaulst value `2`) which specifies the number of lines per segment to display. This parameter is only used if `resegment_text_to_fill_space` is `True`.
102+
* `text_already_spoken_rgb`: List of 3 ints (default value is [49, 46, 61], which makes a dark gray). The RGB values of the color that will be used to highlight text that has already been spoken.
103+
* `text_being_spoken_rgb`: List of 3 ints (default value is [57, 171, 9] which makes a dark green). The RGB values of the color that will be used to highlight text that is being spoken.
104+
* `text_not_yet_spoken_rgb`: List of 3 ints (default value is [194, 193, 199] which makes a dark green). The RGB values of the color that will be used to highlight text that has not yet been spoken.
91105

92106
# Output JSON manifest file format
93107
A new manifest file will be saved at `<output_dir>/<original manifest file name>_with_output_file_paths.json`. It will contain the same fields as the original manifest, and additionally:

tools/nemo_forced_aligner/align.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,12 +110,15 @@ class CTMFileConfig:
110110
@dataclass
111111
class ASSFileConfig:
112112
fontsize: int = 20
113-
marginv: int = 20
113+
vertical_alignment: str = "center"
114114
# if resegment_text_to_fill_space is True, the ASS files will use new segments
115115
# such that each segment will not take up more than (approximately) max_lines_per_segment
116116
# when the ASS file is applied to a video
117117
resegment_text_to_fill_space: bool = False
118118
max_lines_per_segment: int = 2
119+
text_already_spoken_rgb: List[int] = field(default_factory=lambda: [49, 46, 61]) # dark gray
120+
text_being_spoken_rgb: List[int] = field(default_factory=lambda: [57, 171, 9]) # dark green
121+
text_not_yet_spoken_rgb: List[int] = field(default_factory=lambda: [194, 193, 199]) # light gray
119122

120123

121124
@dataclass
@@ -180,6 +183,22 @@ def main(cfg: AlignmentConfig):
180183
if cfg.ctm_file_config.minimum_timestamp_duration < 0:
181184
raise ValueError("cfg.minimum_timestamp_duration cannot be a negative number")
182185

186+
if cfg.ass_file_config.vertical_alignment not in ["top", "center", "bottom"]:
187+
raise ValueError("cfg.ass_file_config.vertical_alignment must be one of 'top', 'center' or 'bottom'")
188+
189+
for rgb_list in [
190+
cfg.ass_file_config.text_already_spoken_rgb,
191+
cfg.ass_file_config.text_already_spoken_rgb,
192+
cfg.ass_file_config.text_already_spoken_rgb,
193+
]:
194+
if len(rgb_list) != 3:
195+
raise ValueError(
196+
"cfg.ass_file_config.text_already_spoken_rgb,"
197+
" cfg.ass_file_config.text_being_spoken_rgb,"
198+
" and cfg.ass_file_config.text_already_spoken_rgb all need to contain"
199+
" exactly 3 elements."
200+
)
201+
183202
# Validate manifest contents
184203
if not is_entry_in_all_lines(cfg.manifest_filepath, "audio_filepath"):
185204
raise RuntimeError(

tools/nemo_forced_aligner/utils/make_ass_files.py

Lines changed: 49 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
PLAYERRESY = 288
3333
MARGINL = 10
3434
MARGINR = 10
35+
MARGINV = 20
3536

3637

3738
def seconds_to_ass_format(seconds_float):
@@ -56,6 +57,11 @@ def seconds_to_ass_format(seconds_float):
5657
return srt_format_time
5758

5859

60+
def rgb_list_to_hex_bgr(rgb_list):
61+
r, g, b = rgb_list
62+
return f"{b:x}{g:x}{r:x}"
63+
64+
5965
def make_ass_files(
6066
utt_obj, output_dir_root, ass_file_config,
6167
):
@@ -107,7 +113,7 @@ def resegment_utt_obj(utt_obj, ass_file_config):
107113
approx_chars_per_line = (PLAYERRESX - MARGINL - MARGINR) / (
108114
ass_file_config.fontsize * 0.6
109115
) # assume chars 0.6 as wide as they are tall
110-
approx_lines_per_segment = (PLAYERRESY - ass_file_config.marginv) / (
116+
approx_lines_per_segment = (PLAYERRESY - MARGINV) / (
111117
ass_file_config.fontsize * 1.15
112118
) # assume line spacing is 1.15
113119
if approx_lines_per_segment > ass_file_config.max_lines_per_segment:
@@ -183,17 +189,30 @@ def make_word_level_ass_file(
183189
"BorderStyle": "1",
184190
"Outline": "1",
185191
"Shadow": "0",
186-
"Alignment": "2",
192+
"Alignment": None, # will specify below
187193
"MarginL": str(MARGINL),
188194
"MarginR": str(MARGINR),
189-
"MarginV": str(ass_file_config.marginv),
195+
"MarginV": str(MARGINV),
190196
"Encoding": "0",
191197
}
192198

199+
if ass_file_config.vertical_alignment == "top":
200+
default_style_dict["Alignment"] = "8" # text will be 'center-justified' and in the top of the screen
201+
elif ass_file_config.vertical_alignment == "center":
202+
default_style_dict["Alignment"] = "5" # text will be 'center-justified' and in the middle of the screen
203+
elif ass_file_config.vertical_alignment == "bottom":
204+
default_style_dict["Alignment"] = "2" # text will be 'center-justified' and in the bottom of the screen
205+
else:
206+
raise ValueError(f"got an unexpected value for ass_file_config.vertical_alignment")
207+
193208
output_dir = os.path.join(output_dir_root, "ass", "words")
194209
os.makedirs(output_dir, exist_ok=True)
195210
output_file = os.path.join(output_dir, f"{utt_obj.utt_id}.ass")
196211

212+
already_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_already_spoken_rgb) + r"&}"
213+
being_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_being_spoken_rgb) + r"&}"
214+
not_yet_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_not_yet_spoken_rgb) + r"&}"
215+
197216
with open(output_file, 'w') as f:
198217
default_style_top_line = "Format: " + ", ".join(default_style_dict.keys())
199218
default_style_bottom_line = "Style: " + ",".join(default_style_dict.values())
@@ -225,7 +244,7 @@ def make_word_level_ass_file(
225244
words_in_first_segment.append(word_or_token)
226245
break
227246

228-
text_before_speech = r"{\c&c7c1c2&}" + " ".join([x.text for x in words_in_first_segment]) + r"{\r}"
247+
text_before_speech = not_yet_spoken_color_code + " ".join([x.text for x in words_in_first_segment]) + r"{\r}"
229248
subtitle_text = (
230249
f"Dialogue: 0,{seconds_to_ass_format(0)},{seconds_to_ass_format(words_in_first_segment[0].t_start)},Default,,0,0,0,,"
231250
+ text_before_speech.rstrip()
@@ -247,16 +266,16 @@ def make_word_level_ass_file(
247266
text_before = " ".join([x.text for x in words_in_segment[:word_i]])
248267
if text_before != "":
249268
text_before += " "
250-
text_before = r"{\c&H3d2e31&}" + text_before + r"{\r}"
269+
text_before = already_spoken_color_code + text_before + r"{\r}"
251270

252271
if word_i < len(words_in_segment) - 1:
253272
text_after = " " + " ".join([x.text for x in words_in_segment[word_i + 1 :]])
254273
else:
255274
text_after = ""
256-
text_after = r"{\c&c7c1c2&}" + text_after + r"{\r}"
275+
text_after = not_yet_spoken_color_code + text_after + r"{\r}"
257276

258-
aligned_text = r"{\c&H09ab39&}" + word.text + r"{\r}"
259-
aligned_text_off = r"{\c&H3d2e31&}" + word.text + r"{\r}"
277+
aligned_text = being_spoken_color_code + word.text + r"{\r}"
278+
aligned_text_off = already_spoken_color_code + word.text + r"{\r}"
260279

261280
subtitle_text = (
262281
f"Dialogue: 0,{seconds_to_ass_format(word.t_start)},{seconds_to_ass_format(word.t_end)},Default,,0,0,0,,"
@@ -307,17 +326,30 @@ def make_token_level_ass_file(
307326
"BorderStyle": "1",
308327
"Outline": "1",
309328
"Shadow": "0",
310-
"Alignment": "2",
329+
"Alignment": None, # will specify below
311330
"MarginL": str(MARGINL),
312331
"MarginR": str(MARGINR),
313-
"MarginV": str(ass_file_config.marginv),
332+
"MarginV": str(MARGINV),
314333
"Encoding": "0",
315334
}
316335

336+
if ass_file_config.vertical_alignment == "top":
337+
default_style_dict["Alignment"] = "8" # text will be 'center-justified' and in the top of the screen
338+
elif ass_file_config.vertical_alignment == "center":
339+
default_style_dict["Alignment"] = "5" # text will be 'center-justified' and in the middle of the screen
340+
elif ass_file_config.vertical_alignment == "bottom":
341+
default_style_dict["Alignment"] = "2" # text will be 'center-justified' and in the bottom of the screen
342+
else:
343+
raise ValueError(f"got an unexpected value for ass_file_config.vertical_alignment")
344+
317345
output_dir = os.path.join(output_dir_root, "ass", "tokens")
318346
os.makedirs(output_dir, exist_ok=True)
319347
output_file = os.path.join(output_dir, f"{utt_obj.utt_id}.ass")
320348

349+
already_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_already_spoken_rgb) + r"&}"
350+
being_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_being_spoken_rgb) + r"&}"
351+
not_yet_spoken_color_code = r"{\c&H" + rgb_list_to_hex_bgr(ass_file_config.text_not_yet_spoken_rgb) + r"&}"
352+
321353
with open(output_file, 'w') as f:
322354
default_style_top_line = "Format: " + ", ".join(default_style_dict.keys())
323355
default_style_bottom_line = "Style: " + ",".join(default_style_dict.values())
@@ -360,7 +392,9 @@ def make_token_level_ass_file(
360392
) # replace underscores used in subword tokens with spaces
361393
token.text_cased = token.text_cased.replace(SPACE_TOKEN, " ") # space token with actual space
362394

363-
text_before_speech = r"{\c&c7c1c2&}" + "".join([x.text_cased for x in tokens_in_first_segment]) + r"{\r}"
395+
text_before_speech = (
396+
not_yet_spoken_color_code + "".join([x.text_cased for x in tokens_in_first_segment]) + r"{\r}"
397+
)
364398
subtitle_text = (
365399
f"Dialogue: 0,{seconds_to_ass_format(0)},{seconds_to_ass_format(tokens_in_first_segment[0].t_start)},Default,,0,0,0,,"
366400
+ text_before_speech.rstrip()
@@ -391,16 +425,16 @@ def make_token_level_ass_file(
391425
for token_i, token in enumerate(tokens_in_segment):
392426

393427
text_before = "".join([x.text_cased for x in tokens_in_segment[:token_i]])
394-
text_before = r"{\c&H3d2e31&}" + text_before + r"{\r}"
428+
text_before = already_spoken_color_code + text_before + r"{\r}"
395429

396430
if token_i < len(tokens_in_segment) - 1:
397431
text_after = "".join([x.text_cased for x in tokens_in_segment[token_i + 1 :]])
398432
else:
399433
text_after = ""
400-
text_after = r"{\c&c7c1c2&}" + text_after + r"{\r}"
434+
text_after = not_yet_spoken_color_code + text_after + r"{\r}"
401435

402-
aligned_text = r"{\c&H09ab39&}" + token.text_cased + r"{\r}"
403-
aligned_text_off = r"{\c&H3d2e31&}" + token.text_cased + r"{\r}"
436+
aligned_text = being_spoken_color_code + token.text_cased + r"{\r}"
437+
aligned_text_off = already_spoken_color_code + token.text_cased + r"{\r}"
404438

405439
subtitle_text = (
406440
f"Dialogue: 0,{seconds_to_ass_format(token.t_start)},{seconds_to_ass_format(token.t_end)},Default,,0,0,0,,"

0 commit comments

Comments
 (0)