
SitePal has always supported what we refer to as “Dynamic TTS”: TTS speech generated and spoken in real time. More recently, with many customer implementations taking the form of AI driven dialog – ‘real time’ has assumed even more importance. For how can you have a conversation if your counterpart takes unnaturally long to respond?
Part of the challenge is that the text to be spoken can be lengthy, and generating TTS audio takes time. The longer the text, the more time required. We address this problem by slicing the text into segments, processing them in parallel, and seamlessly speaking them in sequence. This is done automatically & transparently. The first segment is always the shortest, to minimize initial response time. Subsequent segments are longer, as we process them while the avatar is already speaking.
Recently we’ve gone a step further, allowing users to fine-tune the process. You now have the option to select the optimization level you prefer. There is a balance to be had though, because more segments means more stream usage.
Three optimization levels are provided. ‘Normal’ – which is the mid-level choice – is the default, and should be appropriate for most implementations. If you are using 3rd party TTS voices tough, in particular Eleven Labs or Open AI voices, you may want to select the ‘High’ optimization setting. These providers require a bit more time to generate audios – though improvements are being made all the time and this statement may not be true a few months hence. The best way to decide is to try it. The ‘TTS Response Optimization’ setting can be found in your SitePal Account’s ‘Settings’ page, and it affects all avatars published from your account.

We hope this information proves useful. Please reach out to us with any questions or comments. We’d love to hear from you.
Warm Regards,
The SitePal Team