Microsoft’s text-to-speech system is combined with noise-removing encoder that makes the text-to-speech AI more efficient
Technology major Microsoft, along with the support of Chinese engineers, have developed a text-to-speech artificial intelligence (AI) that can create a realistic speech by utilising just 200 voice samples. In addition, the system has the capability to create matching transcriptions.
The AI compute in part on transformers or deep neural networks that hardly follow the neurons in brain. Transformers assess each input and output on the fly like synaptic links that helps the system in operating complex sentences very effectively.
It is also combined with noise-removing encoder that makes the text-to-speech AI more efficient. However, the outputs are not as perfect as a slight robotic sound, they are remarkably accurate with a word intelligibility of nearly 100 per cent. Particularly, this can make text-to-speech viable for everyone if it is in reach of small firms.
According to the study paper, this is almost an unsupervised method for text-to-speech and automatic speech recognition, which utilises only few paired text and speech data and extra unpaired data. The method comprises various key components such as denoising auto-encoder, bidirectional sequence modeling, dual transformation and a unified model structure to integrate the given components.
A word level intelligible rate of 99.84 per cent has been achieved. For TTS, it is 2.68 MOS and 11.7 per cent PER for ASR with only 200 paired data on LJSpeech dataset, indicating the effectiveness of the method, it added.
For work in the future, the company will push toward the limit of unsupervised learning by leveraging unpaired speech and text data, with the help of other pre-training methods. It will also utilise an advanced model for vocoder in place of Griffin-Lim, like WaveNet, to improve the quality of generated audio.