A Closer Look into TTS Models
In this quest, you will take a deeper dive into the Coqui TTS library, building on the foundations from Quest 1. By setting up a full TTS pipeline in Python, you’ll learn to configure key parameters like models, speakers, and languages, enabling you to produce high-quality, natural-sounding speech tailored to specific needs. You’ll also get hands-on with saving speech outputs as WAV files, exploring the diversity of Coqui’s TTS capabilities. This quest is designed to enhance your practical understanding of how to generate dynamic speech outputs while encouraging experimentation with different configurations.
Before starting this quest, make sure you have completed Quest 1, as it covers the initial setup and environment configuration required for the steps here.
For technical help on the StackUp platform & bounty-related questions, join our Discord, head to the 🏆 | bounty-help-forum channel and look for the correct thread to ask your question.
Learning Outcomes
- Set up a complete TTS pipeline in Python using Coqui TTS.
- Load different TTS models, select compatible voices, and configure localizations.
- Generate speech from text with a variety of voices and languages.
- Save speech outputs as WAV files for further analysis or use.
Tutorial Steps
Total steps: 6
-
Step 1: Environment Setup
In this step, you will continue using the environment and project folder you set up in Quest 1 to create a TTS script for speech synthesis. The core libraries required for this step were already covered in Step 4 of Quest 1. This means you should be able to use the same virtual environment without additional setup.
To begin, open your terminal and navigate to the root of your project folder from Quest 1 using the following command:
Make sure to replace C42-Text-to-Speech with the actual folder location on your machine.
Once in the project directory, you’ll verify that the virtual environment is activated and ready for the next set of TTS commands. This ensures that you have all dependencies in place for the hands-on steps that follow.
Activate the virtual environment if it isn’t already by executing the following command:
On macOS/ Linux:
On Windows:
With the virtual environment activated, you will need to install additional packages needed for the upcoming steps.
Now, open the project in your preferred IDE. For this campaign, VSCode was used as the development environment, but feel free to use any editor you're comfortable with. Once opened, you should see the following directory structure:
The directory structure of the project. Here’s a breakdown of the directory structure for the project:
- output/ - This is where the synthesized audio files will be saved.
- models.py- The script for fetching available TTS models from Cocqui.
- speakers.py - The script for fetching available speakers in a selected TTS model. Coqui models use a parameter called “speaker” to select what type of “voice” to use in generating speeches.
- languages.py- The script for fetching available languages in a selected TTS model.
- tts-script.py - The main file where you will create the TTS script for speech synthesis.'
- requirements.txt - This contains the dependencies needed to run the scripts.
- tts-app.py - The main file where you will create the TTS app with a Gradio Interface.
Gradio is a Python package that lets you build straightforward, approachable web interfaces for machine learning models. Quick demos are an excellent way to showcase your model to users, allowing them to interact with it by entering text or photos and seeing the results instantly.
-
Step 2: Viewing and Selecting Models, Speakers and Languages
To generate effective text-to-speech outputs, you need to configure three key parameters: model, speaker, and language. These parameters determine how your text will be converted into speech, making them central to achieving the desired results.
Among these, the model should be your starting point, as it defines the available options for voices and localizations. Each model in Coqui TTS comes with its own set of compatible voices and languages, so selecting the right model is crucial before configuring the other two parameters.
Open models.py. Write the following code block:
This will print out all available models from Coqui. Save the file and run using the following terminal command:
You should get an output similar to this:
The sample output with the first model highlighted. In this Quest, you will use the model called tts_models/multilingual/multi-dataset/xtts_v2 (marked in red in the image) as it has multilingual support and multiple speakers too.
Features of coqui/XTTS-v2
- Supports 17 languages.
- Voice cloning with just a 6-second audio clip.
- Emotion and style transfer by cloning.
- Cross-language voice cloning.
- Multilingual speech generation.
- 24 khz sampling rate.
For more information on the different Coqui models you can check the official repository here: https://github.com/coqui-ai/STT-models/releases
Next, you’ll select the speaker for your synthesized speech.
Open speakers.py and add the following code block:
Save the file and run using the following terminal command:
You should get an output similar to this:
Sample output showing all available speakers. The available speakers are contained in a list, meaning you can select them using index numbers, starting from 0. For this Quest, you will use Claribel Dervla, the first speaker in the list, which corresponds to index 0.
Note: There will be times that you will be prompted to agree to the non-commercial terms of Coqui specially when downloading a model is involved in your script. The prompt will look something like:
Coqui non-commercial terms prompt To proceed, respond to the prompt by type y, then press return or enter.
Next, you'll choose the language for your speech output. It’s important to mention that this parameter does not translate the input text from one language to another. Instead, the language setting adjusts the accent, pronunciation, and overall speech patterns based on the selected option.
For example, setting the language to "Spanish (LatAm)" produces speech with a Latin American Spanish accent, while "US English" will deliver the output with an American English accent. This allows the synthesized speech to sound more natural and region-specific.
Now, check the available languages. Open languages.py and write the following code block:
Save the file and run using the following terminal command:
You should get an output similar to this:
The sample output showing all available languages. The output confirms that coqui/xtts-v2 supports 17 languages or localizations. They are also contained in a list accessible through index number 0 to 16. You will use index 0 for US English in this Quest.
Now that you know how to view the available options for the three important parameters, you can now proceed to writing the speech synthesis script.
-
Step 3: Navigating the Main File
In this step, you’ll get familiar with the structure of the main TTS script, tts-script.py, which you’ll be using throughout this quest. Understanding the layout and comments in this file will make it easier to follow the upcoming code blocks and modifications.
Open tts-script.py in your IDE. This file is designed to guide you through building a basic TTS implementation using Coqui TTS. The script is structured with specific comments labeled as TODOs to help you identify where new code blocks should be added.
Here is an overview of the TODOs:
# TODO#1 - Import the necessary libraries: This section is where you’ll import essential libraries like Coqui TTS and any other required modules for the script.
# TODO#2 - Load the TTS model: In this part, you will specify which TTS model to load from Coqui’s list of available models, setting up the foundation for generating speech.
# TODO#3 - Select the speaker and language: Here, you’ll define which speaker and language to use for generating the speech output.
# TODO#4 - Set output folder and input text: This comment will guide you to create a directory for saving the output files and define the text to be converted into speech.
# TODO#5 - Generate the speech and save it to a WAV file: This final section will involve writing code that converts the specified text into speech and saves it as a WAV file.
The tts.py file is arranged in a logical flow, starting from library imports and model loading to text synthesis and file output. Each TODO comment is a placeholder that helps maintain this flow, making it easier to integrate the upcoming code blocks step-by-step.
Tip: If you're using Visual Studio Code, you can install an extension called Todo Tree. This extension quickly searches your workspace for comment tags like TODO and FIXME.
-
Step 4: Creating a TTS Script
Locate comment # TODO#1 - Import the necessary libraries. Below this comment, write the following code block:
These imports bring in the core TTS library and the os module, which will help manage directories and files for storing outputs.
Locate comment # TODO#2: Load the TTS model. Below this comment, write the following code block:
This loads the first available model from Coqui TTS (coqui/xtts-v2), which will be used to generate speech. The TTS() function initializes the model and prepares it for use.
Locate comment # TODO#3 - Select the speaker and language. Below this comment, write the following code block:
Here, you are selecting the default speaker (Claribel Dervla) and language (en) in the loaded model. This ensures that the synthesis process will use valid parameters for generating speech.
Locate comment # TODO#4 - Set output folder and input text. Below this comment, write the following code block:
This code ensures the output folder exists and sets the text input to be synthesized into speech. The example text is a simple quote to demonstrate TTS output.
Locate comment # TODO#5 Generate the speech and save it to a WAV file. Below this comment, write the following code block:
This block converts the text into speech using the selected model, speaker, and language, saving the output as a WAV file in the specified folder.
Save the file and run using the following terminal command:
You should get an output similar to this:
The sample output with the success message highlighted. Check your project folder again and look inside the output folder. You should see the audio file being added. If you don’t see anything there you may need to hit refresh. You should see something like:
The generated file. Play the output.wav file on any audio player you have and listen to the synthesized speech.
-
Step 5: Extending Your Knowledge
Now that you’ve created a basic TTS script, try experimenting with different parameters:
- Change the Text: Modify the text variable to test how the TTS model handles various sentences.
- Select a Different Speaker: Adjust the selected_speaker variable to use other speakers available in the model by replacing 0 with other index numbers.
- Change the Language: Modify the selected_language to explore how localization affects the output.
- Vary Output File Names: Set different file_path values to save outputs with different names.
Use these tweaks to deepen your understanding of how Coqui TTS handles different inputs and to see the variety of outputs it can generate.
You’ve now built a complete TTS script using Coqui TTS, gaining practical experience with its core functions and parameters. This foundational knowledge sets you up for more advanced implementations, where you'll explore more diverse voice models, localization effects, and additional synthesis features.
-
Step 6: Conclusion
Congratulations on completing Quest 2! You have successfully built a TTS script using Coqui TTS, explored various models, and experimented with different speakers and languages. Here’s a quick recap of what you’ve accomplished:
- Explored TTS Models: You learned to view available models in Coqui TTS and selected one with multilingual support to produce versatile outputs.
- Configured Speakers and Languages: You identified and configured the parameters for different voices and languages, gaining insights into how they affect the quality and style of speech synthesis.
- Built a Functional TTS Script: You created a script that converts text to speech, saves it as a WAV file, and allows for further customization.
- Extended Your Knowledge: You experimented with modifying the text, speaker, language, and output file names to understand the flexibility and depth of Coqui TTS.
As you continue your journey into more advanced TTS implementations, you can apply these foundational skills to create more complex voice outputs, explore additional models, and fine-tune speech synthesis to meet diverse needs. Up next, you’ll delve into more intricate features of Coqui TTS, including real-time synthesis and user interface integration, setting the stage for more dynamic applications of TTS.
Feel free to revisit any steps in this quest to refine your skills further, and don’t hesitate to try out new ideas to see the full range of what Coqui TTS can do!
Find articles to support you through your journey or chat with our support team.
Help Center