Skip to content

Commit

Permalink
Add steerable TTS guide (#1543)
Browse files Browse the repository at this point in the history
  • Loading branch information
ericning-o authored Nov 1, 2024
1 parent d6d466b commit 108f28a
Show file tree
Hide file tree
Showing 7 changed files with 218 additions and 0 deletions.
5 changes: 5 additions & 0 deletions authors.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,8 @@ danial-openai:
name: "Danial Mirza"
website: "https://github.com/danial-openai"
avatar: "https://avatars.githubusercontent.com/u/178343703"

gbergengruen:
name: "Guillermo Bergengruen"
website: "https://github.com/gbergengruen"
avatar: "https://avatars.githubusercontent.com/u/140010883"
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added examples/voice_solutions/sounds/default_tts.mp3
Binary file not shown.
203 changes: 203 additions & 0 deletions examples/voice_solutions/steering_tts.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Steering Text-to-Speech for more dynamic audio generation\n",
"\n",
"Our traditional [TTS APIs](https://platform.openai.com/docs/guides/text-to-speech) don't have the ability to steer the voice of the generated audio. For example, if you wanted to convert a paragraph of text to audio, you would not be able to give any specific instructions on audio generation.\n",
"\n",
"With [audio chat completions](https://platform.openai.com/docs/guides/audio/quickstart), you can give specific instructions before generating the audio. This allows you to tell the API to speak at different speeds, tones, and accents. With appropriate instructions, these voices can be more dynamic, natural, and context-appropriate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Traditional TTS\n",
"\n",
"Traditional TTS can specify voices, but not the tone, accent, or any other contextual audio parameters."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"client = OpenAI()\n",
"\n",
"tts_text = \"\"\"\n",
"Once upon a time, Leo the lion cub woke up to the smell of pancakes and scrambled eggs.\n",
"His tummy rumbled with excitement as he raced to the kitchen. Mama Lion had made a breakfast feast!\n",
"Leo gobbled up his pancakes, sipped his orange juice, and munched on some juicy berries.\n",
"\"\"\"\n",
"\n",
"speech_file_path = \"./sounds/default_tts.mp3\"\n",
"response = client.audio.speech.create(\n",
" model=\"tts-1-hd\",\n",
" voice=\"alloy\",\n",
" input=tts_text,\n",
")\n",
"\n",
"response.write_to_file(speech_file_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chat Completions TTS\n",
"\n",
"With chat completions, you can give specific instructions before generating the audio. In the following example, we generate a British accent in a learning setting for children. This is particularly useful for educational applications where the voice of the assistant is important for the learning experience."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import base64\n",
"\n",
"speech_file_path = \"./sounds/chat_completions_tts.mp3\"\n",
"completion = client.chat.completions.create(\n",
" model=\"gpt-4o-audio-preview\",\n",
" modalities=[\"text\", \"audio\"],\n",
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak in a British accent and enunciate like you're talking to a child.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": tts_text,\n",
" }\n",
" ],\n",
")\n",
"\n",
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n",
"with open(speech_file_path, \"wb\") as f:\n",
" f.write(mp3_bytes)\n",
"\n",
"speech_file_path = \"./sounds/chat_completions_tts_fast.mp3\"\n",
"completion = client.chat.completions.create(\n",
" model=\"gpt-4o-audio-preview\",\n",
" modalities=[\"text\", \"audio\"],\n",
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak in a British accent and speak really fast.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": tts_text,\n",
" }\n",
" ],\n",
")\n",
"\n",
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n",
"with open(speech_file_path, \"wb\") as f:\n",
" f.write(mp3_bytes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chat Completions Multilingual TTS\n",
"\n",
"We can also generate audio in different language accents. In the following example, we generate audio in a specific Spanish Uruguayan accent."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Había una vez un leoncito llamado Leo que se despertó con el aroma de panqueques y huevos revueltos. Su pancita gruñía de emoción mientras corría hacia la cocina. ¡Mamá León había preparado un festín de desayuno! Leo devoró sus panqueques, sorbió su jugo de naranja y mordisqueó algunas bayas jugosas.\n"
]
}
],
"source": [
"completion = client.chat.completions.create(\n",
" model=\"gpt-4o\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are an expert translator. Translate any text given into Spanish like you are from Uruguay.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": tts_text,\n",
" }\n",
" ],\n",
")\n",
"translated_text = completion.choices[0].message.content\n",
"print(translated_text)\n",
"\n",
"speech_file_path = \"./sounds/chat_completions_tts_es_uy.mp3\"\n",
"completion = client.chat.completions.create(\n",
" model=\"gpt-4o-audio-preview\",\n",
" modalities=[\"text\", \"audio\"],\n",
" audio={\"voice\": \"alloy\", \"format\": \"mp3\"},\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a helpful assistant that can generate audio from text. Speak any text that you receive in a Uruguayan spanish accent and more slowly.\",\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": translated_text,\n",
" }\n",
" ],\n",
")\n",
"\n",
"mp3_bytes = base64.b64decode(completion.choices[0].message.audio.data)\n",
"with open(speech_file_path, \"wb\") as f:\n",
" f.write(mp3_bytes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"The ability to steer the voice of the generated audio opens up a lot of possibilities for richer audio experiences. There are many use cases such as:\n",
"- **Enhanced Expressiveness**: Steerable TTS allows adjustments in tone, pitch, speed, and emotion, enabling the voice to convey different moods (e.g., excitement, calmness, urgency).\n",
"- **Language learning and education**: Steerable TTS can mimic accents, inflections, and pronunciation, which is beneficial for language learners and educational applications where accurate intonation and emphasis are critical.\n",
"- **Contextual Voice**: Steerable TTS adapts the voice to fit the content’s context, such as formal tones for professional documents or friendly, conversational styles for social interactions. This helps create more natural conversations in virtual assistants and chatbots.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
10 changes: 10 additions & 0 deletions registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1675,3 +1675,13 @@
tags:
- evals
- completions

- title: Steering Text-to-Speech for more dynamic audio generation
path: examples/voice_solutions/steering_tts.ipynb
date: 2024-10-21
authors:
- ericning-o
- gbergengruen
tags:
- completions
- audio

0 comments on commit 108f28a

Please sign in to comment.