This API provides real-time text-to-speech conversion using WebSockets. This allows you to send a text message and receive audio data back in real-time.
The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:
The input text is being streamed or generated in chunks.
The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.
This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.
Use this to attempt to immediately trigger the generation of audio, overriding the chunk_length_schedule. Unlike flush, try_trigger_generation will only generate audio if our buffer contains more than a minimum threshold of characters, this is to ensure a higher quality response from our model.
Note that overriding the chunk schedule to generate small amounts of text may result in lower quality audio, therefore, only use this parameter if you really need text to be processed immediately. We generally recommend keeping the default value of false and adjusting the chunk_length_schedule in the generation_config instead.
This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.
Determines the minimum amount of text that needs to be sent and present in our buffer before audio starts being generated. This is to maximise the amount of context available to the model to improve audio quality, whilst balancing latency of the returned audio chunks.
The default value is: [120, 160, 250, 290].
This means that the first chunk of audio will not be generated until you send text that totals at least 120 characters long. The next chunk of audio will only be generated once a further 160 characters have been sent. The third audio chunk will be generated after the next 250 characters. Then the fourth, and beyond, will be generated in sets of at least 290 characters.
Customize this array to suit your needs. If you want to generate audio more frequently to optimise latency, you can reduce the values in the array. Note that setting the values too low may result in lower quality audio. Please test and adjust as needed.
Flush forces the generation of audio. Set this value to true when you have finished sending text, but want to keep the websocket connection open.
This is useful when you want to ensure that the last chunk of audio is generated even when the length of text sent is smaller than the value set in chunk_length_schedule (e.g. 120 or 50).
To understand more about how our websockets buffer text before audio is generated, please refer to this section.
Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.
For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.
A list of starting times (in milliseconds) for each character in the normalized text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.
A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.
The list of characters in the normalized text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.
A list of starting times (in milliseconds) for each character in the original text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.
A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.
The list of characters in the original text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.
Whether to enable request logging, if disabled the request will not be present in history nor bigtable.
Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.
Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend
sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency.
Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The
rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation.
Disabled by default.
You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values:
Value
Description
0
default mode (no latency optimizations)
1
normal latency optimizations (about 50% of possible latency improvement of option 3)
2
strong latency optimizations (about 75% of possible latency improvement of option 3)
3
max latency optimizations
4
max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates).
Example - Voice streaming using ElevenLabs and OpenAI
The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from OpenAI’s GPT model, while the answer is being generated, thereby minimizing the overall latency of the operation.
Copy
Ask AI
import asyncioimport websocketsimport jsonimport base64import shutilimport osimport subprocessfrom openai import AsyncOpenAI# Define API keys and voice IDOPENAI_API_KEY = '<OPENAI_API_KEY>'ELEVENLABS_API_KEY = '<ELEVENLABS_API_KEY>'VOICE_ID = '21m00Tcm4TlvDq8ikWAM'# Set OpenAI API keyaclient = AsyncOpenAI(api_key=OPENAI_API_KEY)def is_installed(lib_name): return shutil.which(lib_name) is not Noneasync def text_chunker(chunks): """Split text into chunks, ensuring to not break sentences.""" splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ") buffer = "" async for text in chunks: if buffer.endswith(splitters): yield buffer + " " buffer = text elif text.startswith(splitters): yield buffer + text[0] + " " buffer = text[1:] else: buffer += text if buffer: yield buffer + " "async def stream(audio_stream): """Stream audio data using mpv player.""" if not is_installed("mpv"): raise ValueError( "mpv not found, necessary to stream audio. " "Install instructions: https://mpv.io/installation/" ) mpv_process = subprocess.Popen( ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"], stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, ) print("Started streaming audio") async for chunk in audio_stream: if chunk: mpv_process.stdin.write(chunk) mpv_process.stdin.flush() if mpv_process.stdin: mpv_process.stdin.close() mpv_process.wait()async def text_to_speech_input_streaming(voice_id, text_iterator): """Send text to ElevenLabs API and stream the returned audio.""" uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2" async with websockets.connect(uri) as websocket: await websocket.send(json.dumps({ "text": " ", "voice_settings": {"stability": 0.5, "similarity_boost": 0.8}, "xi_api_key": ELEVENLABS_API_KEY, })) async def listen(): """Listen to the websocket for audio data and stream it.""" while True: try: message = await websocket.recv() data = json.loads(message) if data.get("audio"): yield base64.b64decode(data["audio"]) elif data.get('isFinal'): break except websockets.exceptions.ConnectionClosed: print("Connection closed") break listen_task = asyncio.create_task(stream(listen())) async for text in text_chunker(text_iterator): await websocket.send(json.dumps({"text": text, "try_trigger_generation": True})) await websocket.send(json.dumps({"text": ""})) await listen_taskasync def chat_completion(query): """Retrieve text from OpenAI and pass it to the text-to-speech function.""" response = await aclient.chat.completions.create(model='gpt-4', messages=[{'role': 'user', 'content': query}], temperature=1, stream=True) async def text_iterator(): async for chunk in response: delta = chunk.choices[0].delta yield delta.content await text_to_speech_input_streaming(VOICE_ID, text_iterator())# Main executionif __name__ == "__main__": user_query = "Hello, tell me a very long story." asyncio.run(chat_completion(user_query))
Example - Other examples for interacting with our Websocket API
Some examples for interacting with the Websocket API in different ways are provided below
Copy
Ask AI
import asyncioimport websocketsimport jsonimport base64async def text_to_speech(voice_id): model = 'eleven_turbo_v2' uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}" async with websockets.connect(uri) as websocket: # Initialize the connection bos_message = { "text": " ", "voice_settings": { "stability": 0.5, "similarity_boost": 0.8 }, "xi_api_key": "api_key_here", # Replace with your API key } await websocket.send(json.dumps(bos_message)) # Send "Hello World" input input_message = { "text": "Hello World ", "try_trigger_generation": True } await websocket.send(json.dumps(input_message)) # Send EOS message with an empty string instead of a single space # as mentioned in the documentation eos_message = { "text": "" } await websocket.send(json.dumps(eos_message)) # Added a loop to handle server responses and print the data received while True: try: response = await websocket.recv() data = json.loads(response) print("Server response:", data) if data["audio"]: chunk = base64.b64decode(data["audio"]) print("Received audio chunk") else: print("No audio data in the response") break except websockets.exceptions.ConnectionClosed: print("Connection closed") breakasyncio.get_event_loop().run_until_complete(text_to_speech("voice_id_here"))
Example - Getting word start times using alignment values
This code example shows how the start times of words can be retrieved using the alignment values returned from our API.
Copy
Ask AI
import asyncioimport websocketsimport jsonimport base64# Define API keys and voice IDELEVENLABS_API_KEY = "INSERT HERE" <- INSERT YOUR API KEY HEREVOICE_ID = 'nPczCjzI2devNBz1zQrb' #Briandef calculate_word_start_times(alignment_info): # Alignment start times are indexed from the start of the audio chunk that generated them # In order to analyse runtime over the entire response we keep a cumulative count of played audio full_alignment = {'chars': [], 'charStartTimesMs': [], 'charDurationsMs': []} cumulative_run_time = 0 for old_dict in alignment_info: full_alignment['chars'].extend([" "] + old_dict['chars']) full_alignment['charDurationsMs'].extend([old_dict['charStartTimesMs'][0]] + old_dict['charDurationsMs']) full_alignment['charStartTimesMs'].extend([0] + [time+cumulative_run_time for time in old_dict['charStartTimesMs']]) cumulative_run_time += sum(old_dict['charDurationsMs']) # We now have the start times of every character relative to the entire audio output zipped_start_times = list(zip(full_alignment['chars'], full_alignment['charStartTimesMs'])) # Get the start time of every character that appears after a space and match this to the word words = ''.join(full_alignment['chars']).split(" ") word_start_times = list(zip(words, [0] + [zipped_start_times[i+1][1] for (i, (a,b)) in enumerate(zipped_start_times) if a == ' '])) print(f"total duration:{cumulative_run_time}") print(word_start_times)async def text_to_speech_alignment_example(voice_id, text_to_send): """Send text to ElevenLabs API and stream the returned audio and alignment information.""" uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2" async with websockets.connect(uri) as websocket: await websocket.send(json.dumps({ "text": " ", "voice_settings": {"stability": 0.5, "similarity_boost": 0.8, "use_speaker_boost": False}, "generation_config": { "chunk_length_schedule": [120, 160, 250, 290] }, "xi_api_key": ELEVENLABS_API_KEY, })) async def text_iterator(text): """Split text into chunks to mimic streaming from an LLM or similar""" split_text = text.split(" ") words = 0 to_send = "" for chunk in split_text: to_send += chunk + ' ' words += 1 if words >= 10: print(to_send) yield to_send words = 0 to_send = "" yield to_send async def listen(): """Listen to the websocket for audio data and write it to a file.""" audio_chunks = [] alignment_info = [] received_final_chunk = False print("Listening for chunks from ElevenLabs...") while not received_final_chunk: try: message = await websocket.recv() data = json.loads(message) if data.get("audio"): audio_chunks.append(base64.b64decode(data["audio"])) if data.get("alignment"): alignment_info.append(data.get("alignment")) if data.get('isFinal'): received_final_chunk = True break except websockets.exceptions.ConnectionClosed: print("Connection closed") break print("Writing audio to file") with open("output_file.mp3", "wb") as f: f.write(b''.join(audio_chunks)) calculate_word_start_times(alignment_info) listen_task = asyncio.create_task(listen()) async for text in text_iterator(text_to_send): await websocket.send(json.dumps({"text": text})) await websocket.send(json.dumps({"text": " ", "flush": True})) await listen_task# Main executionif __name__ == "__main__": text_to_send = "The twilight sun cast its warm golden hues upon the vast rolling fields, saturating the landscape with an ethereal glow." asyncio.run(text_to_speech_alignment_example(VOICE_ID, text_to_send))
Our websocket service incorporates a buffer system designed to optimize the Time To First Byte (TTFB) while maintaining high-quality streaming.
All text sent to the websocket endpoint is added to this buffer and only when that buffer reaches a certain size is an audio generation attempted. This is because our model provides higher quality audio when the model has longer inputs, and can deduce more context about how the text should be delivered.
The buffer ensures smooth audio data delivery and is automatically emptied with a final audio generation either when the stream is closed, or upon sending a flush command. We have advanced settings for changing the chunk schedule, which can improve latency at the cost of quality by generating audio more frequently with smaller text inputs.