Speech recognition applications often transcribe audio. While transcribing from pre-recorded audio files is a common task, many applications require real-time interaction with a user. This involves capturing audio directly from a microphone and processing it instantly. Moving from static files to a live audio stream introduces new considerations, such as managing the microphone resource, detecting when a person starts and stops speaking, and handling background noise.
In this section, you will learn how to write a Python script that listens for your voice through a microphone, captures what you say, and converts it into text in near real-time.
The SpeechRecognition library we introduced earlier can work with different audio sources. In the previous section, you used an audio file as the source. Now, you will use the microphone. To access the microphone, the SpeechRecognition library depends on another library called PyAudio. If you followed the setup instructions, PyAudio should already be installed in your environment.
The process for real-time recognition involves a few distinct steps:
The following diagram illustrates this workflow.
The process of capturing and transcribing live audio, including paths for successful recognition and for handling errors.
Let's write a program that performs a single transcription. It will listen for you to say something, print the result, and then exit.
First, we need to import the library and create an instance of the Recognizer class, just as before.
import speech_recognition as sr
# Create a recognizer instance
r = sr.Recognizer()
Next, we access the microphone. It is good practice to use a with statement, which automatically handles closing the microphone resource when we are done with it.
import speech_recognition as sr
# Create a recognizer instance
r = sr.Recognizer()
# Use the default microphone as the audio source
with sr.Microphone() as source:
print("Say something!")
# Listen for the first phrase and extract it into audio data
audio = r.listen(source)
try:
# Recognize speech using Google's speech recognition
text = r.recognize_google(audio)
print(f"You said: {text}")
except sr.UnknownValueError:
# This error is raised when speech is unintelligible
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
# This error is raised for network-related issues
print(f"Could not request results from Google Speech Recognition service; {e}")
When you run this script, it will print "Say something!" and then wait. The r.listen(source) function actively listens to the microphone. It automatically detects when you start speaking and when you stop. Once you pause, it stops listening and stores the captured audio in the audio variable.
The try...except block is important for handling two common issues:
sr.UnknownValueError: This occurs if the recognizer was unable to understand your speech. Perhaps it was too quiet, or the sound was not recognizable as a word.sr.RequestError: This happens if there's a problem connecting to the recognition service's API, such as a lack of internet connection or an issue with the service itself.A significant challenge in speech recognition is distinguishing speech from ambient noise. A noisy environment, like a room with a fan or people talking in the background, can confuse the recognizer. It might mistake noise for speech or struggle to isolate the speaker's voice.
The Recognizer class includes a helpful method, adjust_for_ambient_noise(), to mitigate this. This function listens to the audio source for a short period (usually one second) to learn the characteristics of the background noise. It then uses this information to set an appropriate energy threshold, making it better at ignoring the noise and detecting actual speech.
You should call this method once after opening the microphone source and before you start listening for speech.
Let's modify our script to include this improvement.
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
# We use the adjust_for_ambient_noise method to account for noise
print("Calibrating for ambient noise, please wait...")
r.adjust_for_ambient_noise(source, duration=1)
print("Say something!")
audio = r.listen(source)
try:
text = r.recognize_google(audio)
print(f"You said: {text}")
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print(f"Could not request results; {e}")
By adding r.adjust_for_ambient_noise(source), the application becomes more reliable in typical environments. The duration parameter can be adjusted if you are in a particularly inconsistent environment, but the default is often sufficient.
Our current script stops after one utterance. To build a more interactive application, such as the voice command tool in the final exercise, the program needs to listen continuously. We can achieve this by placing our listening logic inside a while loop.
The loop will continuously listen for input, attempt to recognize it, and print the result. We can also add a specific command, like "stop," to break the loop and exit the program gracefully.
Here is the complete script for a continuous listener:
import speech_recognition as sr
def main():
"""
Listens for speech in a loop and transcribes it.
"""
r = sr.Recognizer()
with sr.Microphone() as source:
# Adjust for ambient noise once at the start
print("Calibrating...")
r.adjust_for_ambient_noise(source, duration=1)
print("Ready to listen.")
while True:
print("Listening...")
try:
# Listen for the next phrase
audio = r.listen(source)
# Recognize the speech
text = r.recognize_google(audio)
print(f"You said: {text}")
# Add a condition to exit the loop
if "stop" in text.lower():
print("Exiting program.")
break
except sr.UnknownValueError:
# This is not an error, just silence or unintelligible speech
# We can simply continue listening.
pass
except sr.RequestError as e:
print(f"Service error; {e}")
# If the service is down, we might want to stop.
break
if __name__ == "__main__":
main()
This structure forms a solid basis for an interactive speech application. The program first calibrates to the room's noise level. Then, it enters an infinite loop where it listens, transcribes, and prints the text. If it cannot understand the audio (UnknownValueError), it simply ignores it and listens again. The loop only terminates if there is a network error or if the user says a phrase containing the word "stop."
You now have the tools to build applications that respond to live voice input, which you will put into practice in the final exercise of this chapter.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with