Pause
There are a few ways to introduce a pause or break and influence the rhythm and cadence of the speaker. The most consistent way is programmatically using the syntax<break time="1.5s" />
. This will create an exact and natural pause in the speech. It is not just added silence between words, but the AI has an actual understanding of this syntax and will add a natural pause.
However, since this is more than just inserted silence, how the AI handles these pauses can vary. As usual, the voice used plays a pivotal role in the output. Some voices, those trained with a few “uh”s and “ah”s in them, have shown to sometimes insert those vocal mannerisms during the pauses, like a real speaker might.
An example could look like this:
Please avoid using an excessive number of break tags as that has shown to potentially cause some instability in the AI. The speech of the AI might start speeding up and become very fast, or it might introduce more noise in the audio and a few other strange artifacts. We are working on resolving this.
Alternatives
These options are inconsistent and might not always work. We recommend using the syntax above for consistency. One trick that seems to provide the most consistence output - sans the above option - is a simple dash-
or the em-dash —
. You can even add multiple dashes such as -- --
for a longer pause.
...
can sometimes also work to add a pause between words but usually also adds some “hesitation” or “nervousness” to the voice that might not always fit.
Pronunciation
This feature is currently only supported by the “Eleven English V1” and “Eleven Turbo V2” models. In certain instances, you may want the model to pronounce a word, name, or phrase in a specific way. Pronunciation can be specified using standardised pronunciation alphabets. Currently we support the International Phonetic Alphabet (IPA) and the CMU Arpabet. Pronunciations are specified by wrapping words using the Speech Synthesis Markup Language (SSML) phoneme tag. To use this feature, you need to wrap the desired word or phrase in the<phoneme alphabet="ipa" ph="your-IPA-Pronunciation-here">word</phoneme>
tag for IPA, or <phoneme alphabet="cmu-arpabet" ph="your-CMU-pronunciation-here">word</phoneme>
tag for CMU Arpabet. Replace "your-IPA-Pronunciation-here"
or "your-CMU-pronunciation-here"
with the desired IPA or CMU Arpabet pronunciation.
An example for IPA:
Emotion
If you want the AI to express a specific emotion, the best approach is to write in a style similar to that of a book. To find good prompts to use, you can flip through some books and identify words and phrases that convey the desired emotion. For instance, you can use dialogue tags to express emotions, such ashe said, confused
, or he shouted angrily
. These types of prompts will help the AI understand the desired emotional tone and try to generate a voiceover that accurately reflects it. With this approach, you can create highly customized voiceovers that are perfect for a variety of applications.