This blog describes the possibilities of reading audiobooks with software and then creating a digital voice with AI. We have made a prototype voice-over for the Smart Medal App of TDE.
This prototype can be viewed here.
Where we used to listen to radio and watch TV with sound, the 'screen world' in which we live, sound as a carrier of information has faded into the background. Websites & apps are dead silent.
Fortunately, the revolution of conversationational has added some decibels to our interaction with technology and information. Voice input gives voice output (often still task-driven). More and more assistants are available in Dutch, and everything is still in its infancy. How cool would it be if you had your own voice in a digital format? A digital voice. A vocal avatar. You can let a machine pronounce words as if you would have spoken them yourself. Sounds creepy, right? Or is not it? Would it, if properly applied, offer opportunities? For ALS patients, who are in danger of losing their voice as a result of this terrible muscle disease. As a reader, if you are not there yourself. As an organisation, where you literally design your tone of voice.
Handpicked labs, in cooperation with E-sites and TDE, is currently building a prototype for the Smart Medal app. Where your sport experience will be enriched by an audio voice-over based on your running and race data.
With software speech will be generated out of text. We looked at an existing framework that we can use immediately. However, these frameworks did not meet our requirements.
The best known Text to Speech systems are Google Text-to-Speech, Microsoft Watson and Amazon Polly. While the output of these systems sound pretty natural, it is still very easy to hear that the outputs are computer-generated. And for us it’s important to create a naturally spoken story. In addition, only Google’s TTS has a Dutch version.. and it sounds a lot less natural than the English version. Being able to work with Dutch speech is must for our use case.
A second requirement is to not use a generic and unnatural voice, which is what all systems do. In addition, one of our wishes is to have the opportunity to use the voice of a well-known sports reporter.
Based on these requirements, research has been done into a machine learning system with which a Text to Speech model can be trained. This in order to control the naturalness and even the voice itself. Moreover, this is in itself a very interesting technique that is worth investigating.
When examining machine learning systems, our attention was quickly drawn to Tacotron 2; "An end-to-end speech synthesis system by Google". This system precisely meets the requirements; it learns with machine learning to generate natural speech on the basis of text and he has the possibility to take voice from the training data. Training dates are audio audiobooks. The audio samples were very promising.
Ultimately, Tacotron 2 was chosen, a system that generates machine learning models that convert text into natural speech. Both the intonation and the voice are taken from the training data. Finally, a Dutch and an English model were trained with Tacotron 2. These models are of acceptable level and can be further trained to adopt a new voice.
One obstacle, however, was that Google shared the thoughts behind technology, but not the literal implementation as they implemented it. Various agencies and individuals have subsequently made their own implementation of Tacotron 2 and have placed this open source. After looking at these different implementations, the included audio samples and the community, two implementations came out as the best. From NVidia and from Rayhane Mamah.
Both systems were then extensively reviewed and the same training was started for both systems. Although at first sight the implementation of NVidia looked neater and therefore better, the output of Rayhane Mamah was considerably better. Based on this result, the choice was made for the Tacotron 2 implementation of Rayhane Mamah.
Below you will find 6 examples. The first 3 are in English, the next 3 in Dutch. The Dutch audio is generated on the basis of 8hours of audiobooks, read by Herman Koch.
It is therefore a digital version of the voice of Herman Koch.
We noticed that at a certain point we were going against the Tacotron limits in Dutch. The quality of the Tacotron 2 audio leaves something to be desired. Certainly, the statement is a lot more natural than the existing Text to Speech systems, but not always flawless, which is desirable in a production environment. The voice of especially the Dutch version does not sound good enough either. We are still working to improve this. But if there is no improvement, the question is whether the current models are good enough to eventually use an application with a pure voice. Otherwise it must be concluded that the current Tacotron 2 techniques are not yet sufficiently stable and we have to wait for an upgrade.
View a prototype of the Smart Medal app with Tacotron voice-over