AI Text-To-Speech - Play.ht case
Text-To-Speech capabilities are around for more than 10 years. I'm wondering how the Text-To-Speech technology has improved thanks to the improvements in AI area
In November 2022, ChatGPT was released and, since then, AI buzz is all over the place. Many start-ups are implementing products, very fast, around ChatGPT alike technologies.
In today’s issue, I’ll introduce the startup Play.ht and I’ll share my experience using their Text-To-Speech API for a small proof of concept.
Play.ht is a startup from Delaware (US), with around 20 employees, and that started as a Chrome extension for listening to Medium articles back in 2016. It looks like in late 2017, they moved their business to create realistic audio content for customers’ applications, providing the capability to make articles accessible with audio and providing a Text to Audio editor for creating speech.
At the time of writing this issue, Play.ht has raised on Apr 5, 2023, a total of $500K in funding over 1 single round. Play.ht is funded by 2 investors: Y Combinator and 500 Global.
Now that we know a bit more about the company, let’s see how their API behaves.
After setting up my account I had access to the developer portal. Clear documentation and fast onboarding are 2 of the aspects I can highlight. Also, the set of voices they offer looks good enough in my opinion; most of them sound pretty natural, which is one of the main goals of this start-up. For my POC I choose “Larry”, the voice that comes on the API documentation’s examples.
The goal of my POC was to see how fast the onboard on the API is and see how realistic would be the resultant audio (voice) of transforming some of my articles.
I’ve provided to the API this text from one of my newsletter issues:
10 years ago, I started to be more active in the tech events thing. Some of the biggest events I had the opportunity to attend were OpenStack Summit 2015, DockerCon Europe 2018 (2,200 attendees), and KubeCon Europe 2019 (7,700 attendees).
The first run of the application was not as I expected. You can ear it below:
You will have to listen at least a couple of times. You will notice that the AI struggles with the digits; for the part of “2,200 attendees” the AI only recorded the word “attendees” and, actually, it looks like it gets into kind of a loop at the end of the clip.
My thought was that maybe the problem was related to the formatting of the numbers; I’m using a comma, instead of a dot, to separate the thousands.
In my second run, I modified the number format and set dots. This is the result.
You can see that, in this second ride, the AI behaves a bit better. You can identify the first number, “2 point 220 attendees”, in the recording. However, it messes up a bit after that, and the rest of the audio does not make any sense.
It’s important to mention that:
English is not my mother tongue, so my writing could be not the best.
I don’t know if the voice I have chosen is US, UK, or something else, because I was not able to find “Larry” in the list of voices from the Play.th’s developer portal, even though it comes on the API docs.
After those 2 attempts, I decided to stop my POC since the results I got were not what I expected. Taking a look into their roadmap, there are a couple of things that “could” be oriented to the improvement of the audio generation from a text provided.
I think that Play.ht offers a very good developer experience and they are targeting a specific set of interesting use cases through ultra-realistic voices. Also, I’m very happy to see plenty of interesting things in their public roadmap. I desire all the best wishes on their journey.
If you are interested in the most technical part of this, I wrote down an article in my blog, on which you can read about how to set up the usage of the API and a small project for using it.
I leave here some links in case you want to know more:
Are you using any Text-To-Speech AI in your services? What was your experience? Looking forward to reading your comments!