Forum Innovation 2023

Vall-E

Reproduce any voice from 3 seconds of audio

Where?  China

What? 
Vall-E can simulate anyone’s voice with 3 seconds of audio. The AI, developed by Microsoft, is a "neural codec language model," based on EnCodec, released in October 2022 by Meta. The text-to-speech model can preserve speaker's emotional tone and acoustic environment. Vall-E (a reference to the uncanny valley?) analyzes how a person sounds, breaks that information into discrete "tokens", and uses training data to match how that voice would sound if it spoke other phrases. The 3-second sample is therefore the equivalent of the text prompt for GPT3, for example. Since it could be used for deep fakes, the researchers say a detection model is possible like the "watermark" planned for ChatGPT. Uses include speech editing as well as interfaces for retail services or gaming.