From Speech to Text to Speech with AI Using Python: A Practical Guide | by Naomi Kriger | February 2024

How to Create a Speech to Text to Speech Program

It's been exactly a decade since I started attending GeekCon (yes, a geek conference ), a weekend hackathon-makeathon where all projects must be useless and just for fun, and this year there was a twist exciting: all projects had to incorporate some form of ai.

My group's project was a speech-to-text-to-speech game, and here's how it works: the user selects a character to talk to, and then verbally expresses whatever they want to the character. This spoken input is transcribed and sent to ChatGPT, which responds as if it were the character. The answer is then read aloud using text-to-speech technology.

Now that the game is up and running, generating laughter and fun, I've put together this handy guide to help you create a similar game of your own. Throughout the article, we will also explore the various considerations and decisions we made during the hackathon.

Do you want to see the full code? Here is the link!

Once the server is running, the user will hear the app “talk”, which will ask them to choose the figure they want to talk to and start chatting with the selected character. Every time they want to speak out loud, they must hold down a key on the keyboard while speaking. When you finish speaking (and release the key), your recording will be transcribed by Whisper (a text-to-speech model of OpenAI), and the transcript will be sent to ChatGPT for an answer. The response will be read aloud using a text-to-speech library and the user will listen to it.

Disclaimer

Note: The project was developed on a Windows operating system and incorporates the pyttsx3 library, which lacks support for M1/M2 chips. As pyttsx3 is not supported on Mac, users are recommended to explore alternative text-to-speech libraries that support macOS environments.

Open integration

I used two OpenAI Models: Whisperfor speech-to-text transcription, and ChatGPT API to generate responses based on user input to their selected figure. While it costs money to do so, the pricing model is very affordable and personally my bill is still less than $1 for all my usage. To get started, I made an initial deposit of $5 and to date, I have not exhausted this deposit and this initial deposit will not expire for another year.
I do not receive any payment or benefit from OpenAI for writing this.

Once you get your OpenAI API Key – Set this as an environment variable to use when making API calls. Make sure you don't send your key to the codebase or any public location, and don't share it in an insecure way.

Speech to text: create transcription

The implementation of the speech to text feature was achieved using Whispera OpenAI model.

Below is the code snippet of the function responsible for transcription:

async def get_transcript(audio_file_path: str, 
text_to_draw_while_waiting: str) -> Optional(str):
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = Noneasync def transcribe_audio() -> None:
nonlocal transcript
try:
response = openai.Audio.transcribe(
model="whisper-1", file=audio_file, language="en")
transcript = response.get("text")
except Exception as e:
print(e)
draw_thread = Thread(target=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.start()
transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task
if transcript is None:
print("Transcription not available within the specified timeout.")
return transcript

This function is marked as asynchronous (asynchronous) since the API call may take some time to return a response, and we wait for it to ensure that the program does not progress until the response is received.

As you can see, the get_transcript The function also invokes the print_text_while_waiting_for_transcription function. Because? Since obtaining the transcript is a time-consuming task, we wanted to keep the user informed that the program is actively processing their request and is not getting stuck or unresponsive. As a result, this text is printed gradually while the user waits for the next step.

String matching using FuzzyWuzzy to compare text

After transcribing the speech to text, we use it as is or try to compare it with an existing string.

The comparison use cases were: selecting a figure from a predefined list of options, deciding whether to continue playing or not, and when choosing to continue, deciding whether to choose a new figure or continue with the current one.

In such cases, we wanted to compare the transcription of the user's spoken input with the options in our lists and therefore decided to use the FuzzyWuzzy library for string matching.

This allowed the closest option from the list to be chosen, as long as the matching score exceeded a predefined threshold.

Here is a snippet of our function:

def detect_chosen_option_from_transcript(
transcript: str, options: List(str)) -> str:
best_match_score = 0
best_match = ""for option in options:
score = fuzz.token_set_ratio(transcript.lower(), option.lower())
if score > best_match_score:
best_match_score = score
best_match = option
if best_match_score >= 70:
return best_match
else:
return ""

If you want to learn more about the FuzzyWuzzy library and its functions: you can consult an article I wrote about it here.

Get response from ChatGPT

Once we have the transcript, we can send it to ChatGPT to get an answer.

For each ChatGPT request, we added a message requesting a short and fun response. We also said ChatGPT what figure to pretend to be.

So our function looked like this:

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
try:
return make_openai_request(
system_instructions=system_instructions, 
user_question=transcript).choices(0).message("content")
except Exception as e:
logging.error(f"could not get ChatGPT response. error: {str(e)}")
raise e

and the system instructions were as follows:

def get_system_instructions(figure: str) -> str:
return f"You provide funny and short answers. You are: {figure}"

text to speech

For the text-to-speech part, we opted for a Python library called pyttsx3. This choice was not only simple to implement but also offered several additional advantages. It's free, offers two voice options (male and female), and allows you to select the speaking speed in words per minute (speaking speed).

When a user starts the game, they choose a character from a predefined list of options. If we couldn't find a match for what they said within our list, we would randomly select a character from our “alternate figures” list. In both lists, each character was associated with a gender, so our text-to-speech function also received the voice ID corresponding to the selected gender.

This is what our text to speech feature looked like:

def text_to_speech(text: str, gender: str = Gender.FEMALE.value) -> None:
engine = pyttsx3.init()engine.setProperty("rate", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices(0).id if gender == "male" else voices(1).id
engine.setProperty("voice", voice_id)
engine.say(text)
engine.runAndWait()

The main flow

Now that we have more or less all the pieces of our app in place, it's time to dive into the game! The main flow is described below. You may notice some features we haven't delved into (e.g. choose_figure, play_round), but you can explore the full code by clicking checking the repository. Over time, most of these higher-level functions relate to the internal functions we covered above.

Here's a snippet of the main game flow:

import asynciofrom src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, start, play_round, \
is_another_round
def farewell() -> None:
farewell_message = "It was great having you here, " \
"hope to see you again soon!"
print(f"\n{farewell_message}")
text_to_speech(farewell_message)
async def get_round_settings(figure: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "new figure":
return {"figure": "", "another_round": True}
elif new_round_choice == "no":
return {"figure": "", "another_round": False}
elif new_round_choice == "yes":
return {"figure": figure, "another_round": True}
async def main():
start()
another_round = True
figure = ""
while True:
if not figure:
figure = await choose_figure()
while another_round:
await play_round(chosen_figure=figure)
user_choices = await get_round_settings(figure)
figure, another_round = \
user_choices.get("figure"), user_choices.get("another_round")
if not figure:
break
if another_round is False:
farewell()
break
if __name__ == "__main__":
asyncio.run(main())

We had several ideas in mind that we couldn't implement during the hackathon. This was either because we didn't find an API we were happy with during that weekend or because time constraints prevented us from developing certain features. These are the paths we did not take for this project:

Match the response voice with the “real” voice of the chosen figure

Imagine if the user decided to talk to Shrek, Trump, or Oprah Winfrey. We wanted our text-to-speech library or API to articulate responses using voices that matched the chosen figure. However, we were unable to find a library or API during the hackathon that offered this feature at a reasonable cost. We are still open to suggestions if you have any =)

Let users talk to “themselves”

Another interesting idea was to ask users to provide a vocal sample of themselves speaking. We would then train a model using this sample and have all responses generated by ChatGPT read aloud in the user's own voice. In this scenario, the user could choose the tone of the responses (affirmative and supportive, sarcastic, angry, etc.), but the voice would closely resemble the user's. However, we were unable to find an API that supported this within the limitations of the hackathon.

Add an interface to our application

Our initial plan was to include a frontend component in our application. However, due to a last-minute change in the number of participants in our group, we decided to prioritize backend development. As a result, the application currently runs in the command line interface (CLI) and has no user interface.

Latency is what bothers me the most right now.

There are several components in the flow with relatively high latency that, in my opinion, slightly detract from the user experience. For example: the time from when audio input is completed and a transcript is received, and the time from when the user presses a button until the system actually begins recording audio. So if the user starts speaking immediately after pressing the key, there will be at least a second of audio that won't be recorded due to this delay.

Do you want to see the complete project? it's right here!

Furthermore, the warm credit goes to Lior Yardenimy hackathon partner with whom I created this game.

In this article, we learned how to create a speech-to-text-to-speech game using Python and intertwined it with ai. We have used the Whisper model by OpenAI for speech recognition, he played with the FuzzyWuzzy library for text matching, leveraged ChatGPTconversational magic through its developer API, and brought it all to life with pyttsx3 for text to speech. While OpenAIservices of (Whisper and ChatGPT for developers) are modest in cost and economical.

We hope that this guide has been enlightening and motivates you to embark on your projects.

Cheers to coding and fun!

From Speech to Text to Speech with AI Using Python: A Practical Guide | by Naomi Kriger | February 2024

Technical Terrence Team

ERC-404 reinvents fractional NFT ownership; $PANDORA soars

Leave a Reply Cancel reply

Recommended.

Bitcoin in the red, analyst predicts 40% drop to $48,000 before recovery

Great language models help decipher clinical notes | MIT News

Working with a tax preparer isn’t just for tax season

Bitcoin and Ethereum enter the week in the red but this new token is gaining momentum

Litecoin about to explode and surpass Bitcoin? The analyst is super optimistic

Categories

Important Links

From Speech to Text to Speech with AI Using Python: A Practical Guide | by Naomi Kriger | February 2024

How to Create a Speech to Text to Speech Program

Disclaimer

Open integration

Speech to text: create transcription

String matching using FuzzyWuzzy to compare text

Get response from ChatGPT

text to speech

The main flow

Match the response voice with the “real” voice of the chosen figure

Let users talk to “themselves”

Add an interface to our application

Related

Technical Terrence Team

ERC-404 reinvents fractional NFT ownership; $PANDORA soars

Leave a Reply Cancel reply

Recommended.

Bitcoin in the red, analyst predicts 40% drop to $48,000 before recovery

Great language models help decipher clinical notes | MIT News

Working with a tax preparer isn’t just for tax season

Bitcoin and Ethereum enter the week in the red but this new token is gaining momentum

Litecoin about to explode and surpass Bitcoin? The analyst is super optimistic

Categories

Important Links

Get daily news updates to your inbox!