1.Parsing.mp4
2.Full.Show.mp4
Client calls POST /api/upload
with the docx as form data
- the docx is is converted to text
- newlines are minimised
- a record in the
shows
table is inserted
Client calls POST /process/{showId}?provider=openai&model=gpt-4-0125-preview
- happens in [[backend/llm_parsing.py]]
- Uses an LLM to parse out the non-standard doc format into a json that the application can use
Properties extracted for the json:
- dialogue lines
- The emotion that each line is spoken with - this can then be fed into the [[#3. Generative audio - TTS generation]] stage to give the correct emotion to a given line
- sound effects
- a background sound
- can be mixed in whilst dialogue continues
- transitory samples
- that should occur in between speech lines
- a background sound
- characters
- names
- voice characteristics that should be used for each
The best current LLM models can be used:
- OpenAI's chatGPT
- Mistral
- Deepseek
- xAI's grok
We can also execute this process locally with ollama:
POST /process/{showId}?provider=ollama&model=mistral-nemo
Client calls POST /analyze-timing/{show_id}
Adding timestamp data to the formatted JSON script
- TTS vocals are concatenated into a single audio file
- Time vocal lines, along with sound effects inserted in between them are given time positions
- this is so we can accurately place or mix in sound effect audio samples relative to their position in the script whilst preserving:
- Background sounds / continuous ambience that should not
[!info] Improvement Timing of line delivery is important in drama and silence conveys it's own meaning, we could intelligently insert silence or make silence adjustable
Client calls POST /generate-audio/{show_id}?type=dialogue
- Generation happens in happens in [[backend/audio_generation/tts.py]]
- iterates over the lines, and using the character's chosen voice id renders the audio using TTS models
- I've chosen ElevenLabs to rapidly build this but this but models such as fish-tts, and local generation with kokoro-82M TTS are other options.
[!info] Improvement For more fine grained and customisation, vocal actors could record only 10 seconds of audio and then style cloning or finetuning on existing tts models could be used. This could be taken further in order to convey an actor's happy,sad,angry voice even more efficiently
Client calls POST /generate-audio/{show_id}?type=sfx
Logic in [[backend/audio_generation/sfx.py]]
Generative methods with text to audio models can generate infinite variations of a given sound effect referenced in the script.
There is a distinction between transient audio effects (bangs, door close) and long running background ambience (environmental ambience, field recordings)
- Locally via Meta's Audiocraft - cheap but slow and generations are off the mark
- Elevenlab's soundeffects api - best all around
- freesound.org - highest realism, commercial copyright sometimes applies
- normalisation is applied
- crossfading for longrunning ambience samples
Client calls POST /generate-audio/{show_id}?type=music
An optional ambience which can be subtly layered into the background
- right now from freesound.org but suno.ai could be used
- Up till this stage we have been building each audio block in a modular fashion for robustness and editing granularity
- once the user it happy with the audio in the UI, we can call another endpoint in [[backend/main.py]] to stich all these audio components together into a single audiofile
- We can additionally write metadata to this file
- websockets for audio processing feedback - these are long running tasks
- goaudio fingerprint interested into the file so we can trace who is using the product in the wild and be secured against potential copyright issues
- custom ai model trained on a radioshow dataset
- We could infer the sound effects, tone from the broader context of the show==
- Right now the generative audio
- simple, lightweight database
- JSON columns used - as the product is MVP stage, but it would be better to move to a more thought out table relationship
- audio blobs used but disk folders and S3 would be another option
- Modern react framework with SSR
- Server endpoint extracts the docx to txt
- wavesurfer.js used for audio waveform rendering
- Shancn ui components
- FastAPI
- Standard for ML
- Audio processing with
pydub
- use
previous_text, next_text
to improve [[#3. Generative audio - TTS generation]] tone / continuity / flow - parsing docx to timestamped json [[#1. Script parsing]]
- Dialogue list
- Character list
- SFX list
- cached LLM responses
- Changing character voices
- multi language support - Choose language - English version
- Re-render each sound effect with audiocraft or elevenlabs
- Re-render (long running task then UI update)
- Re-render music button in UI chooses a different one
- AI generated sound effects [[AudioCraft by Meta (MusicGen)]]
- use [[uv - built in rust]]
- Audio TTS generation for each character line - list of .wav files
- with elevenlabs api
- Audio stichting
- TTS rendered in order
- Sound effects rendered in order
- Background music