From Script to Speech: Behind the Scenes of Text-to-Speech Development
That familiar voice you hear every morning on the subway or the announcements on your flight— where do they come from, and how are they created? In many cases, these voices aren't human; they're AI-generated synthetic speech. These sophisticated systems undergo extensive recording and training to sound as natural as your next-door neighbour. In this article, we'll take you behind the scenes of text-to-speech (TTS) technology, exploring the intricate process of recording voice samples and programming AI to understand natural language to make these synthetic voices sound so lifelike.
To get a glimpse into the process of going from script to speech, I asked CereProc’s Director of Professional Services, Graham Leary, to answer some questions.
What are the fundamental steps of developing TTS technology?
1. Create a script- This can be generated from our data sources, freely available resources or text provided by CereProc's customers. It is important that the script is phonetically balanced, which means that is tries to include all the different variations of a language in the sentences. We strive to find suitable text that gives good phonetic coverage for a language and covers any additional domains or genres we wish to add.
2. Get a voice talent- At CereProc, we work with voice talents who can help us record over 10 hours of speech. This is essential as it forms the structure for our synthetic voice, and it provides the data to train our model. Consistency is key in this step; the recordings’ sound, volume and speech rate must be uniform throughout.
3. Language model- Each language we create a voice for has its own language model. Here, we input text that is then converted into phonetic representations. This basically involves the model taking the words, expanding them, and looking up all the words in a lexicon to extract or predict the pronunciations. This phonetic data is what the language model uses to synthesize the speech. Simultaneously, we make sure to process and optimize all the data during this step.
4. Build the voice- With the script, recordings and language model, we can actually start building the voice. At CereProc we can build two different types of synthesis voice. The most advanced are CereWave AI voices which uses DNNs to create an AI model of the speaker’s voice. This involves learning the patterns of the speaker’s speech and reproducing their speech when synthesizing text. We can also build older-style unit selection voices, in which the output consists of tiny fragments of the original voice recordings.
5. Ready for launch! When the product is finalized and has undergone quality controls, it is launched and made available on all platforms listed on cereproc.com.
These are the five essential steps of creating a synthetic voice. However, my questions did not stop here. I was curious to know more about the challenges that developers encounter during this extensive process.
So, what are the main challenges when it comes to training an AI model to understand natural language?
One significant challenge is voice adaptation, which involves using models to enhance the sound of the voice while retaining the character of the main speaker's voice. This often requires input from other speakers' data to maintain consistency. Another issue is that with a small amount of data, creating a good-sounding voice can result in a loss of character, making the voice sound more generic and robotic.
What about when it comes to capturing different nuances of language?
This is an interesting question. Capturing the nuances of jokes and irony is difficult when building a synthetic voice. Timing is crucial here and it is something that we humans have learned naturally since we started talking. Some methods can be used to help reproduce these subtleties such as vocal puppetry. In vocal puppetry, a human provides the template for the speech, and the synthesized voices follow.
Finally, why is CereProc in the forefront of developing TTS?
Well, first of all, CereProc is a research-led company, constantly pushing the boundaries of science and technology. Our commitment to innovation ensures that we are always advancing our capabilities. Additionally, we place a strong emphasis on creating high-quality products and consistently work to implement improvements across our entire product range. This dedication to research and quality sets us apart in the field of text-to-speech development.
This article has taken us through the five essential steps of building synthetic speech. From crafting balanced scripts to hiring talented voice actors, the creation of synthetic voices truly blends technology with artistry. In this landscape, CereProc stands out for its commitment to innovation and continually pushing text-to-speech boundaries.