A new way to make AI-generated voices more expressive

Newswise – Researchers have found a way to make AI-generated voices, such as digital personal assistants, more expressive, with minimal training. The method, which translates text to speech, can also be applied to voices that have never been part of the system’s training set.

The team of computer scientists and electrical engineers at the University of California at San Diego presented their work at the ACML 2021 conference, which was held online recently.

In addition to personal assistants for smartphones, homes and cars, the method could help improve voiceovers in animated films, automatic speech translation in multiple languages, and more. The method could also help create personalized voice interfaces that empower people who have lost the ability to speak, similar to the computerized voice Stephen Hawking used to communicate, but much more expressive.

“We’ve been working in this area for quite a long time,” said Shehzeen Hussain, a Ph.D. student at UC San Diego Jacobs School of Engineering and one of the lead authors of the article. “We wanted to examine the challenge of not only synthesizing the speech, but adding expressive meaning to that speech. ”

Existing methods fail this job in two ways. Some systems can synthesize expressive speech for a specific speaker using several hours of training data for that speaker. Others can synthesize speech from just a few minutes of speech data from a speaker never before encountered; but they are not able to generate expressive speech and only translate text into speech. On the other hand, the method developed by the UC San Diego team is the only one that can generate, with a minimum of training, an expressive speech for a subject that has not been part of their training set.

The researchers reported the pitch and pace of speech in training samples, as an indicator of emotion. This allowed their cloning system to generate expressive speech with minimal training, even for voices he had never encountered before.

“We demonstrate that our proposed model can make a new voice express, emote, sing or copy the style of a given reference speech,” the researchers write.

Their method can learn speech directly from text; reconstructing a speech sample from a target speaker; and transferring the pitch and rhythm of speech from a different expressive speaker to cloned speech for the target speaker.

The team are aware that their work could be used to make deepfake videos and audio clips more precise and compelling. As a result, they plan to publish their code with a watermark that will identify the speech created by their method as cloned.

“Expressive voice cloning would become a threat if you could create natural intonations,” said Paarth Neekhara, the article’s other lead author and PhD holder. computer science student at Jacobs School. “The biggest challenge here is detecting these media and we’ll focus on that next. ”

The method itself still needs to be improved. He is biased towards English speakers and struggles with speakers with a strong accent.

Audio examples: https://expressivecloning.github.io/

Expressive neural voice cloning


Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley, University of California at San Diego

Source link

About Donald P. Hooten

Check Also

Distributed deep learning method without sharing sensitive data

Data sharing is one of the major challenges of machine learning models. The advent of …