Text-to-speech is a great tool and it’s a key ingredient to enable audio creation at scale. However, you’ll often hear people say that it’s still lacking a certain naturalness, human speakers have. What they allude to is for the most part connected with an engineered / synthetic voice’s speech melody and elements, otherwise called prosody. As text to speech innovation further develops (to an ever increasing extent), this issue will ultimately be settled by researchers.
However, prosody is not the only way to improve naturalness. The way we speak and how we translate text into speech is just as important. And this is what Aflorithmic’s Voice Intelligence is taking care of.
In order to explain the problem, let’s look at a concrete, very popular use case that involves text to speech: audio breaking news. An ideal application of synthetic speech, it lets news outlets convert existing written news items into highly up to date audio newscasts.
However, the way we write articles doesn’t always work when converting them to audio. Also, text to speech models aren’t perfect and they might get pronunciation wrong for specific words.
Let’s say a publisher wants to convert this piece of news to a paragraph in their audio breaking news podcast of the day:
“The Ironman World Championship took place this morning in St. George, Utah. The swim took place in Sand Hollow Reservoir, where the fresh water was a chilly 15 °C and gusts of win reached of almost 100 km/h. The ironman (1978-2022) is widely considered one of the most difficult one-day sporting events in the world.”
In this text there are a number of abbreviations and specific formats that need to be converted into spoken text. I’ve marked them in bold italics. One would think that text to speech providers handle this conversion on their side but the complicated truth is: it depends. This problem is called ‘normalization’.
In order to create audio at scale you’ll need automation and that includes any voice to pronounce text exactly the same way. For example, a written time span (1978-2022) should be spoken as ‘nineteen seventy eight until twenty twenty two’, not as ‘open brackets one thousand seventy eight hyphen two thousand twenty two close brackets’.
However, depending on the voice provider and even the voice itself, you’ll get different results. When building audio news in an automated fashion this is clearly a road block, especially when you use different voices for your audio creation.
There are a multitude of cases and they all require a single holistic solution. Here are just a few:
- Time formats
- Data formats
- Currency formats
- Number formats (cardinal: 1, 2.., 1,234)
- Order formats (1st, 2nd, 3rd)
And even that is just the tip of the iceberg: Depending on the language you’re using these formats change as well. For example, in German a number or money expression with decimals uses a comma instead of a period and the currency symbol sits after the number (English: $14.99 vs. German: 14,99$)
It gets even more complicated if you think about specific purposes of the audio content. For example the expression 15 °C will mostly be pronounced as ‘fifteen degrees’ in daily news but a scientific magazine might want to add the word ‘celsius’ at the end, given that they might deal with kelvin or fahrenheit in the same article.
I could go on and on, talking about tone of voice or black/whitelabeling but you get the point.
Using Voice Intelligence you can now access a single truth for normalization problems. It’s a single line of code you can add to your speech creation process within Aflorithmic’s api.audio.
Aflorithmic Labs, Ltd is a London/Barcelona-based technology company. The api.audio platform enables fully automated, scalable audio production by using synthetic media, voice cloning, and audio mastering, to then deliver it on any device, such as websites, mobile apps, or smart speakers.
With this Audio-As-A-Service, anybody can create beautiful sounding audio, starting from a simple text to including music and complex audio engineering without any previous experience required.
The team consists of highly skilled specialists in machine learning, software development, voice synthesizing, AI research, audio engineering, and product development.