Sunday, July 21, 2024

Microsoft’s AI speech generator achieves human parity but is too dangerous for the public

Must Read

Too real: Microsoft has developed a new version of its neural codec language model, Vall-E, that outperforms previous efforts in terms of naturalness, speech robustness, and speaker similarity. It’s the first of its kind to achieve human parity on a pair of popular benchmarks, and is apparently so realistic that Microsoft has no plans to grant access to it to the public.

Building on Vall-E’s groundwork, the new AI speech tool integrates two important enhancements that greatly improve performance. Batch code modeling allows Microsoft to better organize codec codes, resulting in shorter sequence lengths that increase inference speed and help overcome the challenges associated with modeling long sequences.

Meanwhile, repetition-aware sampling rethinks the original kernel sampling process to look for token repetition when decoding. Microsoft said this process helps stabilize decoding and prevents the infinite loop problem that was present in the original Vall-E.

Microsoft put Vall-E 2 through its paces using the LibriSpeech and VCTK datasets, and it passed with flying colors. When Redmond claims that the AI ​​tool achieves human parity, they mean that Vall-E 2 performed better than real samples in terms of robustness, similarity, and naturalness. In other words, the tool can produce natural speech that is virtually identical to that of the original speaker.

Microsoft shared dozens of Vall-E 2 samples, which can be found on the project overview page. In fact, Vall-E 2 samples are incredibly lifelike and indistinguishable from the human speaker. The AI ​​tool even masters subtleties like putting emphasis on the right word in a sentence, just like people unconsciously do when they speak.

Microsoft said Vall-E 2 is purely a research project and added that it has no plans to incorporate the technology into a consumer product or release the tool to the general public. Redmond also noted that it carries a potential risk of misuse, such as impersonating a specific person or spoofing voice identification.

That said, the company believes it could have applications in education, translation, accessibility, journalism, authored content and chatbots, among others.

Image credit: Rootnot Creations.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest News

Bulls are signing free agent guard to a two-way contract

It's not unusual to see young players impress during the NBA Summer League and land a spot on...

More Articles Like This