New AI Mimics Any Voice in a Matter of Minutes
The story starts out like a bad joke: Obama, Clinton and Trump walk into a bar, where they applauded a new startup based in Montreal, Canada called Lyrebird.
If the scenario seems too bizarre to be real, you’re right—it’s not. The entire recording was generated by a new AI with the ability to mimic natural conversation, at a rate much faster than any previous speech synthesizer.
Announced last week, Lyrebird’s program analyzes a single minute of voice recording and extracts a person’s “speech DNA” using machine learning. From there, it adds an extra layer of emotion or special intonation, until it nails a person’s voice, tone and accent—may it be Obama, Trump or even you.
While Lyrebird still retains a slight but noticeable robotic buzz characteristic of machine-generated speech, add some smartly-placed background noise to cover up the distortion, and the recordings could pass off as genuine to unsuspecting ears.
Creeped out? You’re not alone. In an era where Photoshopped images run wild and fake news swarms social media, a program that can make anyone say anything seems like a catalyst for more trouble.
Yet people are jumping on. According to Alexandre de Brébisson, a founder of the company and current PhD student at the University of Montreal, their website scored 100,000 visits on launch day, and the team has attracted the attention of “several famous investors.”
While machine-fabricated speech sounds like something straight out of a Black Mirror episode, speech synthesizers—like all technologies—aren’t inherently malicious.
For people with speech disabilities or paralysis, these programs give them a voice to communicate. For the blind, they provide a way to tap into the vast text-based resources on paper or online. AI-based personal assistants like Siri and Cortana rely on speech synthesizers to create a more natural interface with users, while audiobook companies may one day utilize the technology to automatically and cheaply generate products.
“We want to improve human-computer interfaces and create completely new applications for speech synthesis,” explains de Brébisson to Singularity Hub.
Lyrebird is only the latest push in a long line of research towards natural-sounding speech synthesizers.
The core goal of these programs is to transform text into speech in real time. It’s a two-pronged problem: for one, the AI needs to “understand” the different components of the text; for another, it has to generate appropriate sounds for the input text in a non cringe-inducing way.
Analyzing text may seem like a strange way to tackle speech, but much of our intonation for words, phrases and sentences is based on what the sentence says. For example, questions usually end with a rising pitch, and words like “read” are pronounced differently depending on their tense.
But of the two, generating the audio output is arguably the harder task. Older synthesizers rely on algorithms to produce individual sounds, resulting in the characteristic robotic voice.
These days, synthesizers generally start with a massive database of audio recordings by actual human beings, splicing together voice segments smoothly into new sentences. While the output sounds less robotic, for every new voice—switching from female to male, for example—the software needs a new dataset of voice snippets to draw upon.
Because the voice databases need to contain every possible word the device uses to communicate with its user (often in different intonations), they’re a huge pain to construct. And if there’s a word not in the database, the device stumbles.
By listening to voice recordings the AI learns the pronunciation of letters, phonemes and words. Like someone learning a new language, Lyrebird then uses its learned examples to extrapolate new words and sentences—even ones it’s never learned before—and add on top emotions such as anger, sympathy or stress.
At its core, Lyrebird is a multi-layer artificial neural network, a type of software that loosely mimics the human brain. Like their biological counterparts, artificial networks “learn” through example, tweaking the connections between each “neuron” until the network generates the correct output. Think of it as tuning a guitar.
Similar to other deep learning technologies, the initial training requires hours of voice recordings and many iterations. But once trained on one person’s voice, the AI can produce a passable mimic of another voice at thousands of sentences per second—using just a single minute of a new recording.
That’s because different voices share a lot of similar information that is already “stored” within the artificial network, explains de Brébisson. So it doesn’t need many new examples to pick up on the intricacies of another person’s speaking voice—his or her voice “DNA,” so to speak.
Although the generated recordings still have an uncanny valley quality, de Brébisson stresses that it’ll likely go away with more training examples.
“Sometimes we can hear a little bit of noise in our samples, it’s because we trained our models on real-world data and the model is learning the background noise or microphone noise,” he says, adding that the company is working hard to remove these artifacts.
Adding little “extra” sounds like lip smacking or intaking a breath could also add to the veracity of machine speak.
These “flaws” actually carry meaning and are picked up by the listener, says speech researcher Dr. Timo Baumann at Carnegie Mellon University, who is not involved with Lyrebird.
But both de Brébisson and Baumann agree that the hurdles are simple. Machines will be able to convincingly copy a human voice in real-time in just a few years, they say.
De Brébisson acknowledges that mimicking someone else’s voice can be highly problematic.
Fake news is the least of it. AI-generated voice recordings could be used for impersonation, raising security and privacy concerns. Voice-based security systems would no longer be safe.
While Lyrebird is working on a “voice print” that will easily tell apart originals and generated recordings, it’s unreasonable to expect people to look for such a mark in every recording they come across.
Then there are slightly less obvious concerns. Baumann points out that humans instinctively trust sources with a voice, especially if it’s endowed with emotion. Compared to an obvious synthetic voice, Lyrebird is much easier to connect with, like talking to an understanding friend. While these systems could help calm people down during a long wait on the phone, for example, they’re also great tools for social engineering.
People would more likely divulge personal information or buy things the AI recommends, says Baumann.
In a brief statement on their website, Lyrebird acknowledges these ethical concerns, but also stressed that ignoring the technology isn’t the way to go—rather, education and awareness is key, much like when Photoshop first came into the social consciousness.
“We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible,” they write, adding that “by releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks.”
Lyrebird is too optimistic to completely discount the risks. Without doubt, fake audio clips are coming, and left unchecked, they could wreak havoc. But although people are still adapting to fake images, fake news and other construed information that warps our reality, the discussion about alternative facts has entered the societal mainstream, and forces have begun pushing back.
Like the delicate voice-mimicking songbird it’s named after, Lyrebird is a wonder—one that we’ll have to handle with thought and care.