Q: Can CereProc create a voice from my speech similar to the one created for Roger Ebert?
A: CereProc offer voice creation as a service. Customers interested in having a voice created from their recorded speech should first read the following points:
1. Quantity of audio
As of March 2010, CereProc are able to offer two different types of voices: unit selection voices and HTS voices. Unit selection voices are the type of voices you are able to hear in the demo bar at the top of this page and can be built with 4 hours of recorded speech. HTS is a different voice-building technology that can build a voice with 40 minutes of speech data and compensate for the lesser quantity of data.
2. Recording quality
The recorded speech should be of the highest quality, preferably studio-recorded. Audio with background noise, music, other people talking, etc. would have an extremely negative impact on the quality of a CereProc voice, and such audio would have to be rejected.
3. Transcription coverage
The customer should have the recorded speech transcribed. When building the voice for Roger Ebert, his DVD commentaries were transcribed by professional transcribers. CereProc would recommend using a professional transcriber to do this work.
4. Consistent speech style
When recording a voice, CereProc try to ensure that the speaker maintains a consistent speaking style. Constant sharp changes in pitch, speed, volume, distance between speaker and microphone, or large changes in speaking style will have a negative impact on the voice and may result in the audio having to be rejected.
5. HTS voice development
CereProc are continually working to improve their voice-building technology. As mentioned above, CereProc offer HTS voices. These have a different sound quality to our unit selection voices, but can be built with a lot less data. Some examples of HTS voices can be heard below:
6. Voice creation costs
Voice creation is time consuming and therefore expensive. CereProc would be very reluctant to go ahead with a voice without guaranteed quality audio data and transcriptions.
7. Customers with no recordings, insufficient recordings or recordings of insufficient quality
Unfortunately, CereProc is unable to build a voice without a sufficient quantity of high quality speech recordings. For customers in this position, CereProc is able to offer a number of cost-effective off-the-shelf voices. These are available from the CereProc Online Store. More accents and different languages will be released in the coming months.
Q: I wish to record my own voice with a view to having CereProc build a voice from my recordings in future. How should I go about it?
A: For customers wishing to record an archive of their own voice, CereProc recommends installing Audacity to record speech. It's free, available for Windows, Mac OSX and Linux, and can be downloaded from here:
For voice recording, a good quality microphone should be used, and the recordings should take place in a quiet environment, preferably a recording studio. A quiet room in a house should be okay, as long as there is no background noise, e.g. TV, radio, music, traffic noises, etc. Customers should record themselves reading online newspaper articles, one at a time.
Good quality new sources are best. For those in the UK, the BBC and the Guardian are good sources; for those in the US, it's better to use US sources, such as New York Times, Chicago Sun Times, Washington Post etc. The speaker should try to read a variety of topics, e.g. general news, politics, business, international news, a bit of sport, a bit of weather, etc, and some articles on topics of personal interest, e.g. cinema, as this will give better genre coverage. Articles should be recorded one at a time, and each wave file given a sensible name, e.g.
abc_nyt_20100309_001.wav (abc = speaker initals; nyt = New York Times, or change these to indicate where the article came from; 20100309 = today's date; 001 = article no.1)
In addtion, the text of the article should be saved into a text file and given a matching name. E.g. for abc_nyt_20100309_001.wav, the text read by the speaker should be saved into a file called abc_nyt_20100309_001.txt
The more articles recorded the better - customers should aim to get at least 40 mins of recordings to build an HTS voice, or 4 hours+ for a unit selection voice.
Most important of all, customers try to keep the recordings consistent, so that each recording session sounds identical (or as close to identical as possible) to the previous sessions. This means - always using the same room to make the recordings, making sure the speaker's position in the room is the same, always using the same microphone and ensuring the distance between the microphone and the speaker, and the speaker's sitting position are always the same.
The speaker should try to read at a measured pace, without speeding up or slow down. Big changes in pitch and volume should also be avoided. The speaker's tone should be neutral and the speaker should not put too much interpretation on what (s)he is reading - keep it like a newsreader. Disfluencies, such as "erm", "um", etc. should be kept to an absolute minimum. The fundamental point to remember is that when a CereProc voice synthesises a word or phrase, it can select units of speech from any of the recordings used to build the voice, so if there are big differences between each recording, this has a extremely negative impact on the quality of the voice.
Customers should save both the wav files and the txt files into a directory on their computer and build up the quantity of files over a period of time. The files should be periodically backed up. CereProc rarely records a speaker for sessions longer than three hours, including breaks, so CereProc recommends frequent, short recording sessions.