Welcome to the CereVoice Demo. Enter text here and press play.

  • English
  • French
  • German
  • Italian
  • Spanish
×
×

CereProc Research and Development

CereProc is committed to developing innovative new text-to-speech technologies. We believe that speech synthesis could, and should, be used more widely than today. We believe that creating characterful, engaging, emotional voices is the key to making this happen.

One of our current areas of development focuses on improving techniques for emotional text to speech.

Emotional Tags

Introduction

Human language has an extremely rich vocabulary to describe emotion. In order to make sense of this researchers tend to describe emotional states along two continuums: Positive-Negative and Active-Passive. Also, some emotions are very tied to content or dialogue context (such as suprise, dissapointment or pride).

CereVoice implements the simulation of emotion across what we term an emotional continuum (see Figure 1). CereVoice also has a separate system for improving the clearness of sections of speech so that information can be given as effectively as possible (See below).

The Emotional Continuum

CereVoice uses two separate techniques to simulate emotional states. The first is to select tense or calm voice quality. This compares closely with the perception of negative and positive emotional states (however, it also has an active/passive effect to some extent). The second is to use digital signal processing (DSP) techniques to alter the speech to active or passive states. Active states involve: faster speech rate, higher volume and higher pitch, Passive states involve: slower speech rate, lower volume and lower pitch.

The stronger the emotion the harder it can be to simulate from pre-recorded speech. This is because the DSP 'tricks' that can be used to simulate emotions begin to include artifacts. For example slowing the speech rate and lowering the pitch of speech with a negative voice quality will make the voice sound sad. The slower the sadder, until a certain point is reached when the speech sounds artificially slowed down and unnatural.

Thus, although we can simulate quite a wide variation in the underlying emotion of our voices, we do not have complete control. We can not make our voice sound furious, or in agony.

Figure 1 shows the emotional space, termed a activation-evaluation space. The grey bands show the emotions that we can simulate with CereVoice.

Figure 1:

Table 1 shows how various emotions can be arranged in the evaluation/activation space continuum. The '+' sign means a more extreme value. The (+Content) means that the emotion will be simulated if appropriate content is used.

Table 1:
Active NegativeActive Positive
++ Angry
++ Frightened/Scared/Panicked
+ Tense/Frustrated/Stressed/Anxious
Authoratative/Proud(+Content)
++ Happy
+ Upbeat/Surprised(+Content)/Interested(+Content)
Passive Negative Passive Positive
++ Sad
Dissapointed(+Content)/Bored
+ Relaxed
Concerned/Caring

Table 2 has links to audio files to demonstrate the use of the CereVoice system across this emotional continuum. DSP changes in each section can not always give the same variation. For example the voice can sound relaxed (positve passive) but not really relaxed to different degrees.

Table 2:

Active Negative Active Positive
sound icon ++ You never listen to anything I say
sound icon + You're driving too fast
sound icon There are no gas stations nearby
sound icon ++ What a lovely day
sound icon + That's an interesting idea
sound icon Traffic is moving smoothly today
Passive Negative Passive Positive
sound icon + I feel sad
sound icon I feel down
sound icon I feel relaxed

CereVoice Emotional Tags

CereProc provide a set of emotional tags that bundle together the voice quality and DSP approaches described above, for example:

<voice emotion="cross">
sound icon I'm in a bad mood so don't come anywhere near me.
</voice>
<voice emotion="happy">
sound icon I'm in a good mood so let's go out and do something fun.
</voice>
<voice emotion="sad">
sound icon I'm in a sad mood so I can't be bothered to do anything, poor me.
</voice>
<voice emotion="calm">
sound icon I'm in a relaxed mood so I can handle anything today.
</voice>

Speaking Clearly

A critical part of any speech synthesis system is to convey important information. Within CereVoice this is implemented by using a calm voice style, making all words their full forms, and by emphasising words or phrases that contain the vital information. Emphasis has three levels of degree.

Below are examples of this technique. The emphasised words are shown in italics:

Clear 0: sound iconThe address is 10245 Flat 7 12th St.
Clear 1: sound iconThe address is 10245 Flat 7 12th St.
Clear 2: sound iconThe address is 10245 Flat 7 12th St.
Clear 3: sound iconThe address is 10245 Flat 7 12th St.

Emphasis involves several factors, phrasing is added, amplitude is increased, speech rate is slowed and pitch is lowered.

Conclusion

CereVoice uses a combination of pre-recorded speech styles and DSP techniques to achieve the simulation of emotion and the simulation of careful speech. The emotional continuum is much larger than can be accessed with this method and we have shown the areas that can be effectively simulated using this technique.

A critical element of careful speech is to be able to mark important information bearing sections of speech so that they can be manipulated individually. A large amount of appropriate clarity can be added by inserting short phrase breaks appropriately.