Category: Technical Updates

In this category we will publish periodic technical updates, mainly focused on API use.

We’d like to call your attention to a recent update that will help you take advantage of the most recent cutting edge TTS voices – as soon as they are made available by our 3rd party TTS partners.

SitePal currently supports TTS voices from: Google Cloud, MS-Azure, AWS Polly, Eleven Labs & Open AI.

To use these 3rd party TTS voices with the SitePal API, you need to specify the desired voice using 3 parameter values: the Engine ID (EID), the Language ID (LID), and the Voice ID (VID).

This document, TTS Languages & Voices (available on our support page), provides you with the IDs to use for all the TTS voices we provide or support through our partners.

We have noticed however that available voices for certain 3^rd party providers are frequently updated, with new voices added and older voices sometimes dropped. Consequently, the information may be out of date while we catch up. This specifically relates to voices from Google Cloud (Engine 11), MS-Azure (Engine 12), and Amazon Polly (Engine 13).

That is why we’ve introduced an alternative way of calling the SitePal API, specifically applicable to Google, Azure & Amazon voices. You now may – if you prefer – call our API with Language and Voice IDs obtained directly from the TTS voice list on the respective provider’s pages. Our API automatically recognizes which method you are using.

This allows you to use any voice made available by Google, Azure & AWS as soon as it is made available, even if it has not yet been configured and setup by us – and is not listed in our ‘TTS Languages & Voices’ doc. Supported TTS Voices for each 3^rd party can be found here:

Google Cloud Voices

https://cloud.google.com/text-to-speech/docs/voices

Azure TTS Voices

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts

Amazon Polly Voices

https://docs.aws.amazon.com/polly/latest/dg/voicelist.html

An example should help make this clear. Let’s say you want to use the Google voice –

en-GB-Chirp3-HD-Achernar

Following the ‘standard’ method, you would look this voice up in the voice table, and identify the voice ID as 135. You could then call any SitePal API speech function (such sayText) with the following parameter values:

EID = 11 (the Google engine ID)

LID = 1 (for English, see language table)

VID = 135

Using the direct method – you would locate the voice on Google’s list of voices on their site, and use the following values:

EID = 11

LID = en-GB

VID = en-GB-Chirp3-HD-Achernar

Both calls would produce exactly the same result.

Note: both LID & VID parameters must match in modality, either text or numeric.

We hope this information proves useful. Please reach out to us with any questions or comments. We’d love to hear from you.

Warm Regards,

The SitePal Team

SitePal has always supported what we refer to as “Dynamic TTS”: TTS speech generated and spoken in real time. More recently, with many customer implementations taking the form of AI driven dialog – ‘real time’ has assumed even more importance. For how can you have a conversation if your counterpart takes unnaturally long to respond?

Part of the challenge is that the text to be spoken can be lengthy, and generating TTS audio takes time. The longer the text, the more time required. We address this problem by slicing the text into segments, processing them in parallel, and seamlessly speaking them in sequence. This is done automatically & transparently. The first segment is always the shortest, to minimize initial response time. Subsequent segments are progressively longer, as we process them while the avatar is already speaking.

Recently we’ve gone a step further, allowing users to fine-tune the process. You now have the option to select the optimization level you prefer. There is a balance to be had though, because more segments means more stream usage.

Three optimization levels are provided. ‘Normal’ – which is the mid-level choice – is the default, and should be appropriate for most implementations. If you are using our 3rd party TTS voices tough, in particular Eleven Labs or Open AI voices, you may want to select the ‘High’ optimization setting. These providers require a bit more time to generate audios, especially longer ones (though improvements are being made all the time and this statement may not be true a few months hence).

Already mentioned above, but worth repeating, is that the length of the text is a major factor determining the time required to generate the audio. And the impact of the length of the text on response times is greater with some engines than with others.

The ‘High’ optimization setting improves the initial response time by segmenting the text input more aggressively. Specifically – the first segment becomes very short, and subsequent (longer) segments are also correspondingly shorter to allow for seamless playback.

But, as with all things, there is no free lunch. We mentioned that more segments mean a higher stream usage. But there is another caveat. Ideally, we prefer to break our audio segments at what we call “natural break points” – such as an end of sentence, or a comma. But with a more limited range in which to locate an optimal segment break point – it may not be possible in every case to select a natural one. This may result in a (barely) noticeable segment break in mid sentence.

So it’s a tradeoff. The best way to decide is to try. The ‘TTS Response Optimization’ setting can be found in your SitePal Account’s ‘Settings’ page, and it affects all avatars published from your account.

We hope this information proves useful. Please reach out to us with any questions or comments. We’d love to hear from you.

Warm Regards,

The SitePal Team

Tags blog_slider

Eleven Labs (EL) have just introduced a faster TTS model, Flash 2.5, which is specifically geared towards conversational agents.
See their announcement here –
https://www.youtube.com/watch?v=0YmHnkTVkFA

According to EL, speed is faster & quality is lower.

We reviewed the new model and can confirm that in the SitePal environment overall latency is reduced by 15% to 20% on average for English. For non-English improvement is much greater – with average latency reduced by 50% to %60.

As the new model is multilingual it can be used for all supported languages, which is why latency for non-English is more significantly improved. Until now, non-English input was processed using the slower multilingual model.

We were not able to audibly detect loss of quality. We concluded that the difference in quality is not meaningful in an online conversation scenario.

We have therefore modified the API to use the new Flash v2.5 model as its default model for EL, which means it will be used if you do not specify a different model when calling the API.

This update is now implemented. There is nothing you need to do if you use the default engine.

To review for yourself – check here – https://elevenlabs.io/app/speech-synthesis/text-to-speech (select model at top right)

If you prefer not to use the new default model, specify the model name in the xdata1 parameter when calling sayText or sayAI. To review this and other options for fine tuning EL audio generation, see details in the SitePal API reference. Check out the parameters for the sayText or sayAI functions & look for ‘xdata1’.

New model:
model_id=eleven_flash_v2_5

Previously used model – for English:
model_id=eleven_turbo_v2

Previously used model – for non-English:
model_id=eleven_multilingual_v2

Eleven Labs (EL) TTS can be integrated with SitePal by adding your EL API key to your SitePal ‘Connect’ page, and is one of several 3rd party TTS providers available for use with SitePal Avatars, to complement the built in TTS voices. Using 3rd party TTS requires the Platinum Plan.

Hello SitePal Fans,

In this update we will cover –

TTS latency optimization
API access to raw audio data
JavaScript frameworks support
Eleven Labs audio: lipsync fix

TTS Latency Optimization – we’ve recently come to realize that TTS response times have deteriorated due to increasing levels of usage. Both additional load and accumulated data have had a detrimental effect over time.
We’ve just implemented an update to address the problem. With this update response times are now faster than they have ever been. In most use cases you should find that latency has been reduced by over 50%.

JavaScript Framework Examples – We’ve added new framework examples for ReactJS, NextJS, Angular and Vue. This should prove helpful as customers have repeatedly run into issues that were not covered by our initial technical example. We’ve identified a set of five examples that together cover most envisioned scenarios.

The following examples are now available for each framework:

Technical example: Dynamic TTS, page navigation, receiving callbacks
Responsive example: Responsive embed in framework
Conversation Example: Multiple avatars, the use of Portals.
AI Text Example: Using the sayAI API with text input
AI Audio Example Using the sayAI API with audio input

All the examples come with full source code, and are available on the SitePal support page.

Access Audio Raw Data – Note: this feature is available to Platinum and Integrator Plan customers.

We’ve added a new API function ‘getAudioObject’ that enables access to the raw audio data as it is being played. To understand why this feature might be useful, let us share the scenario that brought this about.

In an AI Agent implementation our customer preferred to have an active mic at all times to allow users to interrupt the avatar. They used echo cancellation to eliminate avatar speech from the input & required access to the realtime audio data to make this work.

We’re sure customers will find other equally cool and unexpected ways to take advantage of this feature.

Eleven Labs Lip Sync Fix – due to special characteristics of Eleven Labs audios they were not being correctly processed, causing lipsync to be sub-par for Eleven Labs audios. This problem was identified and fixed on Aug 9^th.

Also, if you’ve missed it please see our recent announcement on this blog regarding Chatbase Pre-Integration. We’re very excited about it.

Of our future plans & projects perhaps worth mentioning at this time are the following –

Flutter Support – we’re looking to add Flutter support and Flutter code examples to our support materials. This should be available in the next few weeks.

Photoface 3D API – the intention is to enable customers to enable their end users or customers with a Photoface like tool, that end users could use to create & use their own avatars. This feature will be available to Integrator Plan customers only. This project is in the planning phase. Please contact us if you have an interest in this capability.

That’s it for the moment. We hope that you have found this information useful. Please contact us with any comments or questions. We’d love to receive your feedback.

Warm Regards,

The SitePal Team

www.sitepal.com

“What does your site say?”