Voice Assistants: Whose voices do we hear when we listen to Siri or Alexa?
Experts in telecommunications, industrial engineering, and IT have developed the artificial intelligence that gives Google Assistant, Siri, and Bixby their voices. The most common techniques used to generate responses are based on templates that contain changing parameters.
Having a voice assistant at home or in the office is becoming much more commonplace. They come integrated with smartphones, intelligent speakers, and other devices like televisions and video game consoles. Millions of users per day listen to the voices of Siri, Alexa, or Google Assistant. Even so, and despite the frequency of interaction, these voices are virtually strangers to most users. Who are the masterminds who have given these assistants a voice? How are artificial voices created? How are the messages and feedback composed?
The use of artificial intelligence is the most advanced technique to “teach” a voice assistant to talk. As explained by Fernando Cerezal, an innovation engineer with the tech lab at BBVA Next Technologies, a BBVA company specializing in software engineering, "It involves recording someone reading a given text with as many word variables as is reasonable. Then, the recording along with the original text – marked up to reflect elements of prosody (the branch of linguistics that deals with intonation, stress, and pronunciation) – is provided to the artificial intelligence tool, which is asked to imitate the recorded voice.” Through this kind of iterative training, artificial intelligence manages “to talk” just like a person, and similarly can learn to read texts even when they differ from those used to train the system.
In other words, the timbre, the intonation, and the intensity of these virtual assistant voices are generated by artificial intelligence, which tries to imitate a person who has previously pronounced similar texts. This process is repeated for each language. Cerezal stresses that there is a lot of information about the pronunciation and prosody that is not in the text itself. "As a simple example, in English there are a lot of words that although they are written very similarly, they are pronounced quite differently. A lilting intonation is also very important; in languages like Chinese, it can change the meaning of a phrase,” Cerezal explains.
Assistants can perform multiple tasks: checking items on the calendar, looking up facts and figures, searching for recipes, controlling what’s on TV, regulating a room’s lighting or temperature, doing the shopping, or putting on appropriate music, whether for a party or for studying. Two years ago, BBVA became one of the first banks to provide this kind of functionality. In its case, the purpose of the BBVA virtual assistant is to help customers better understand and manage their finances. Configuring account settings and performing administrative tasks in the customer area are some of the more frequent queries customers make.
According to Cerezal, the most common techniques to generate responses are template based. A set expression is written with a parameter that can vary. This is how to create “the perception of personal expression, even if it’s rather mechanical.” Furthermore, it is possible to include various templates that the assistant cycles through “in order to give some variability and cut down on the robotic airs.”
The user experience challenge that virtual assistants are now facing is one of personality
Cerezal maintains that behind these assistants, there are experts in telecommunications, industrial engineering, and information technologies. As an example, Amazon explains on its website that thousands of developers contributed to making Alexa functional. Among the professionals this tech giant has hired, there are software engineers, system analysts, linguists, and writers.
Companies in the field are relying on these employee profiles to improve the voice, breadth of ability, and general performance of these assistants, thus simplifying, for example, life in the home. The number of smart speakers worldwide doesn’t stop growing. Statista forecasts that in 2019 there will be a total of 3.25 billion virtual assistants, reaching 5.11 billion in 2021 and eight billion in 2023.
The voice of the virtual assistant has evolved over the years. Cerezal points out that at first the approach was to focus on associating a type of sound to each syllable, and then trying to compose words: "Because the voice generation was syllable-focused, the resulting sentences were monotone, dull, and grew boring very quickly. Furthermore, in order to solve problems associated with syllables that are pronounced differently but written the same, hand-written rules were created to change the pronunciation.”
Cerezal goes on to explain that as time went on, other improvements were introduced that tried to eliminate manual work, producing a more flexible voice and a more natural feeling. The current approach uses deep learning “using data that are easy to generate, so a computer learns to generate the appropriate sounds.”
Cerezal believes that "there is no ideal, generic voice". There could be voices that are more appropriate to use in certain cases. But, because there are third-party applications that might use these assistants, and the voice has to be intelligible to everyone, a more or less neutral voice is used. This voice ends up too serious if it’s asked to tell jokes, or too distant if it’s asking you how you feel today, but it fits well with the majority of use cases.”
He also contemplates the possibility of a future when assistants change their intonation in specific circumstances, so that users feel they are more empathetic. "As people we humanize virtual assistants, and in one way or another depending on what they tell us and how, we project a personality on it. The way the assistant expresses itself has to be consistent with its voice,” he says.
According to Lucas Menéndez, also an engineer in the tech lab at BBVA Next Technologies, the user experience challenge that virtual assistants are now facing is one of personality. "It’s not only about equipping a voice assistant with a dynamic personality, instead it’s about how to get this personality to evolve in step with the user interaction, adapting itself to the context of the interaction.”