Vall-E

Name: Forum Innovation 2023
Start: 2023-04-18T09:00:00+02:00
End: 2023-04-18T17:00:00+02:00

Reproduce any voice from 3 seconds of audio

Where? China

What?

Vall-E can simulate anyone’s voice with 3 seconds of audio. The AI, developed by Microsoft, is a "neural codec language model," based on EnCodec, released in October 2022 by Meta. The text-to-speech model can preserve speaker's emotional tone and acoustic environment. Vall-E (a reference to the uncanny valley?) analyzes how a person sounds, breaks that information into discrete "tokens", and uses training data to match how that voice would sound if it spoke other phrases. The 3-second sample is therefore the equivalent of the text prompt for GPT3, for example. Since it could be used for deep fakes, the researchers say a detection model is possible like the "watermark" planned for ChatGPT. Uses include speech editing as well as interfaces for retail services or gaming.

Check their Website