17/02/2014 CereProc's Chief Scientific Officer Matthew Aylett on Speech Synthesis Personification Technology in Spike Jonze's film 'Her'
The film Her explores the relationship between self, technology and love in terms of a relationship between a lonely man and his computerised personal assistant, a sort of super Siri with the husky, beautiful voice of Scarlet Johansson, and full artificial intelligence, called Samantha. The film “explores the evolving nature - and the risks - of intimacy in the modern world.” Her was released in UK cinemas on Valentine's Day.
Her raises interesting questions about our relationship to technology and to each other. In particular, the idea of giving technology a personality, who controls this technology, and whether such technology is a barrier or a facilitator for human relationships.
We will not spoil your enjoyment of the film by discussing what its conclusions are, but instead will look at the actual technology available today and, as experts in speech synthesis, one area of the technology required to build Samantha, the choices, dangers and opportunities that such technology offers.
Personification is the act of giving a non-human object human qualities or abilities. 'Personification Technologies' can be regarded as a combination of speech technology, multimodal interfaces, embodied conversational agents, knowledge representation and inference and human language technology, that are required to produce an agent which can be personified by a user during a task.
In Her, personification technology is used as a means to deal with some age-old themes and, by doing so, give them context and relevance to our modern world. Perhaps the most poignant is the cautionary tale of loving something not human, the most observant, concerning how technology is changing the way we interact with each other, and the most sinister, considering where the power lies in controlling the technology we consume. The extent Personification Technologies are directly relevant to these themes varies. Although the idea originally came to Spike Jonze when he interacted with a chatbot, the advent of Apple Siri has made the idea seem deliciously close to reality.
Indeed, from a commercial perspective, you can view Siri as being about power. It's about controlling a direct channel to the user and dominating it. Controlling such a direct channel is one way of making a lot of money, Google have their search engine, Apple have iTunes, Amazon have their store. This direct relationship with users is seen as so powerful that companies like Twitter and Facebook, with moderate advertising incomes, have been valued in the billions. Imagine how pleased Google must have been for Siri to use their search engine, and present the result to the user without the user seeing any of their ads.
As a result, large US corporates have been on a buying spree for technology required to make such personified agents, in particular, speech and language technology. While academics within speech technology are welcoming this new interest in their field of expertise, they are seeing an aggressive recruitment drive of automatic speech recognition (ASR) researchers into industry. Apple set up its first ever R&D lab outside Cupertino in Boston, and it’s a speech lab, Amazon has bought ASR and speech synthesis companies, while Google purchased a speech synthesis company as far back as 2010. Thus for engineers like us, the theme in her that focuses on who controls this technology, and what their motives are for deploying it, is timely.
New approaches to capture, share and manipulate information in sectors such as health care and the creative industries require computers to enter the arena of human social interaction. Users readily adopt a social view of computers and previous research has shown how this can be harnessed in applications such as: health advice, tutoring, or helping children overcome bullying. These applications depend on 'Personification Technology' in order to simulate the characteristics of a person, so that users can interact using the day-to-day strategies of human communication. This part of the application is sometimes referred to as a 'Conversational Agent' (CA) just like Samantha in Her.
Going beyond practical applications, Personification Technology can also help give meaning and control back to users. Ambiguity is a core part the complexity and subjectivity of our lives. For thousands of years, art, music, drama and story telling has helped us understand, come to terms with, and express the complexities of our existential experience. Technology has long played a pivotal role in this artistic process, for example the role of optics in the development of perspective in painting and drawing, or the effect of film on story telling. However, current computer interfaces that help us understand, come to terms with, and mediate the explosion of electronic data, and electronic communication that now exists are generally limited to the mundane. Whereas the ability to get the height in metres of Everest is a trivial search request (8,848m by the way from a Google search), googling the question ‘What is love?’ returns (in the top four), two popular newspaper articles, a YouTube video of Haddaway and a dating site. It is, of course, an unfair comparison. Google is not designed to offer responses to ambiguous questions with no definite answers. In contrast, traditional forms of art and artistic narrative have done so for centuries. Personification Technologies offer a way of bridging the gap between the information we have available and the meaning it has for us.
Technology has become part of our social fabric and, as such, this technology needs to be able engender playfulness, and enrich our sense of experience. In doing so it must attempt to reconnect with our human experience. When we give a child a teddy bear we don't agonise over whether it will replace the child's love for her friends and family. We happily use the bear as a personified agent, talking to it with our children and using it as means of play and communication. Often you can hear your child talking to the bear, going over social experiences and social constraints. In this way the bear is also a social tool to help a child interpret and make sense of his or her world (As well as being cosy to cuddle).
Adding a rubber handle to a hammer doesn't make us believe we can bang in nails with our hands, it just makes the tools more suited to its role. Adding fur to a child's teddy bear doesn't make us believe its a real bear, it just adds to the childs' ability to personify the bear, build a social relationship with it, and hug it. Allowing us to talk to a computer or mobile phone and simulating a personality, can be viewed as simply adding the 'rubber handle' to our tools. Adding such functionality to our tools can help them work in the social domain. This allows us to use these tools to help people in this domain, such as children who are suffering, or perpetrating, bullying.
The use of such techniques need not be so grandiose and all encompassing as Samantha in Her. For example, MyMyRadio is a tool in this social sphere which tries to take users away from the constant clicking, updating as well as trying to remove that little screen between us and the beautiful world we live in. MyMyRadio takes social media and news and repackages it as tiny little podcasts in between music tracks on an iPhone. It uses a characterful synthetic voice to make this feel like a personal radio station and aims to re-introduce the sense of discovery and serendipity back to our experience of social media. The sense of personification in MyMyRadio, created using speech synthesis, is a key to its success.
After all computers are just tools. And just like stories are used as social tools to help us understand ourselves and our world, Personification Technologies can help computers function in this social space. They can give us more time to enjoy the world around us, give us a sense of control over the information we want to access, and help us to enjoy their use.
Bwaa Haahahahah - and build super intelligent evil robots of course...
Technology required for Her
Artificial Intelligence (AI)
There is a difference between being intelligent and simulating intelligence. Simulating intelligence within constraints is tractable. For example, if an application responds appropriately to user input, has a way of modelling the user and 'remembering' what the user is doing then it can appear to be intelligent. For Samantha the constraints are very broad, which makes such a mimicry of intelligence way beyond current techniques. However the context of her is constrained, a personal assistant on a mobile phone, so providing intelligence beyond the Siri-like is tractable and also scalable. A system could get smarter about say, selecting information to present to you, knowing when to interrupt you, learning how to address a user and respond to predictable interactions in a way that appears intelligent. Currently, if the user takes a step out of these expected contexts, the appearance of intelligence will collapse like a house of cards. A key issue is whether the user is searching for that collapse of intelligence, or is willing to play along with the nicety of a mimicked artificial intelligence. After all, people chit chat endlessly without really needing to know about the complex internal state of a conversational partner.
We can regard AI as how a system should respond to input. For her that input is mostly the users's speech input and the response is in terms of output speech. Thus we need to convert input speech into a form which models meaning and relates this to previous input and the systems internal state. For example, if a user says the same things in different ways, it needs to produce a similar analysis for the AI to work with. It needs to take the output from the AI in terms of actions, intentions and information and produce an grammatical, and correct utterance. The better the language understanding, the more appropriate the output, and the better the mimic of intelligence. Language understanding has become sophisticated, and is often able to deal with open domain input within a constrained set of contexts. Again, Siri, is a good example of this. However, relating multiple inputs and modelling the meaning of a conversation is still limited to very constrained tasks. Less work has explored the tricks of controlling a dialogue to guide the user to tasks that the system is able to perform. Eliza, a famous 80s online therapist was mostly focused on doing this. Such approaches can be effective, but can also lead a user to feel like they are playing a parlour game.
Automatic speech recognition (ASR) has come in for some hard criticism of the years. In part, this is connected with the frustration of not being able to communicate successfully with technology, and perhaps even more by the technological hell that is the automated phone system. (Which incidentally came top in Wired's 12 most annoying technologies in 2012). In her the word recognition would not be the main problem. Given predictable interactions, and a single user's voice to model, speech recognition would perform well for neutral speech input. The real problem would be dealing with speech which is emotionally charged and to recognise the emotions in the speech. This is still a very hard problem. Some automatic phone systems will try and recognise if a customer is upset so they can deal with this more effectively, but nuanced understanding of a specific user's emotional state is far beyond the current state-of-the-art.
Siri is not a great advert of state-of-the-art speech synthesis (although probably a better advert than Stephen Hawking). Current commercial systems can produce excellent natural sounding results in neutral contexts which convey a sense of character. Copying Scarlett Johansson's voice is not the hard bit, in fact, if there were no legal implications, we could do exactly this using the audio book “The Dive from Clausen's Pier” which she narrated back in 2002. The real challenge is producing natural sounding emotional output. In the film trailer, the voice of Samantha is immediately emotional and intimate. Providing the emotional variation is reasonably constrained, and given we might predict much of the emotional content in advance, then, if we could record Scarlett Johansson we could make a good attempt at reproducing such constrained emotional speech. (We are not sure how much Scarlett would charge for a 45 hour recording session to be used for any purpose as her agent never got back to us). The output would not be as flawless and would not be able to act, but the real challenge would be how to predict and control the emotional response.