Capturing and Transcribing Microphone Input in Real-Time

Speech recognition applications often transcribe audio. While transcribing from pre-recorded audio files is a common task, many applications require real-time interaction with a user. This involves capturing audio directly from a microphone and processing it instantly. Moving from static files to a live audio stream introduces new challenges, such as managing the microphone resource, detecting when a person starts and stops speaking, and handling background noise.

In this section, you will learn how to write a Python script that listens for your voice through a microphone, captures what you say, and converts it into text in near real-time.

The Microphone as an Audio Source

The SpeechRecognition library we introduced earlier can work with different audio sources. In the previous section, you used an audio file as the source. Now, you will use the microphone. To access the microphone, the SpeechRecognition library depends on another library called PyAudio. If you followed the setup instructions, PyAudio should already be installed in your environment.

The process for real-time recognition involves a few distinct steps:

Access the microphone as an audio source.
Listen for a phrase. The library is smart enough to wait for a pause before it marks the phrase complete.
Capture the spoken audio data.
Send this captured data to a recognition engine.
Process the transcribed text or any errors that occur.

The following diagram illustrates this workflow.

The process of capturing and transcribing live audio, including paths for successful recognition and for handling errors.

Your First Live Transcription Script

Let's write a program that performs a single transcription. It will listen for you to say something, print the result, and then exit.

First, we need to import the library and create an instance of the Recognizer class, just as before.

import speech_recognition as sr

# Create a recognizer instance
r = sr.Recognizer()

Next, we access the microphone. It is good practice to use a with statement, which automatically handles closing the microphone resource when we are done with it.

import speech_recognition as sr

# Create a recognizer instance
r = sr.Recognizer()

# Use the default microphone as the audio source
with sr.Microphone() as source:
    print("Say something!")
    # Listen for the first phrase and extract it into audio data
    audio = r.listen(source)

try:
    # Recognize speech using Google's speech recognition
    text = r.recognize_google(audio)
    print(f"You said: {text}")
except sr.UnknownValueError:
    # This error is raised when speech is unintelligible
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    # This error is raised for network-related issues
    print(f"Could not request results from Google Speech Recognition service; {e}")

When you run this script, it will print "Say something!" and then wait. The r.listen(source) function actively listens to the microphone. It automatically detects when you start speaking and when you stop. Once you pause, it stops listening and stores the captured audio in the audio variable.

The try...except block is important for handling two common issues:

sr.UnknownValueError: This occurs if the recognizer was unable to understand your speech. Perhaps it was too quiet, or the sound was not recognizable as a word.
sr.RequestError: This happens if there's a problem connecting to the recognition service's API, such as a lack of internet connection or an issue with the service itself.

Handling Background Noise

A significant challenge in speech recognition is distinguishing speech from ambient noise. A noisy environment, like a room with a fan or people talking in the background, can confuse the recognizer. It might mistake noise for speech or struggle to isolate the speaker's voice.

The Recognizer class includes a helpful method, adjust_for_ambient_noise(), to mitigate this. This function listens to the audio source for a short period (usually one second) to learn the characteristics of the background noise. It then uses this information to set an appropriate energy threshold, making it better at ignoring the noise and detecting actual speech.

You should call this method once after opening the microphone source and before you start listening for speech.

Let's modify our script to include this improvement.

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    # We use the adjust_for_ambient_noise method to account for noise
    print("Calibrating for ambient noise, please wait...")
    r.adjust_for_ambient_noise(source, duration=1)

    print("Say something!")
    audio = r.listen(source)

try:
    text = r.recognize_google(audio)
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print(f"Could not request results; {e}")

By adding r.adjust_for_ambient_noise(source), the application becomes more reliable in typical environments. The duration parameter can be adjusted if you are in a particularly inconsistent environment, but the default is often sufficient.

Creating a Continuous Listening Loop

Our current script stops after one utterance. To build a more interactive application, such as the voice command tool in the final exercise, the program needs to listen continuously. We can achieve this by placing our listening logic inside a while loop.

The loop will continuously listen for input, attempt to recognize it, and print the result. We can also add a specific command, like "stop," to break the loop and exit the program gracefully.

Here is the complete script for a continuous listener:

import speech_recognition as sr

def main():
    """
    Listens for speech in a loop and transcribes it.
    """
    r = sr.Recognizer()

    with sr.Microphone() as source:
        # Adjust for ambient noise once at the start
        print("Calibrating...")
        r.adjust_for_ambient_noise(source, duration=1)
        print("Ready to listen.")

        while True:
            print("Listening...")
            try:
                # Listen for the next phrase
                audio = r.listen(source)

                # Recognize the speech
                text = r.recognize_google(audio)
                print(f"You said: {text}")

                # Add a condition to exit the loop
                if "stop" in text.lower():
                    print("Exiting program.")
                    break

            except sr.UnknownValueError:
                # This is not an error, just silence or unintelligible speech
                # We can simply continue listening.
                pass 
            except sr.RequestError as e:
                print(f"Service error; {e}")
                # If the service is down, we might want to stop.
                break

if __name__ == "__main__":
    main()

This structure forms a solid basis for an interactive speech application. The program first calibrates to the room's noise level. Then, it enters an infinite loop where it listens, transcribes, and prints the text. If it cannot understand the audio (UnknownValueError), it simply ignores it and listens again. The loop only terminates if there is a network error or if the user says a phrase containing the word "stop."

You now have the tools to build applications that respond to live voice input, which you will put into practice in the final exercise of this chapter.

Was this section helpful?

References

SpeechRecognition Documentation, Anthony Zhang, 2023 - Provides comprehensive details on the library's API, including microphone handling, recognition methods, and noise adjustment, essential for practical implementation.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - A comprehensive textbook covering the theoretical foundations and practical aspects of automatic speech recognition, including audio processing and real-time considerations.
Digital Speech Processing: Fundamentals and Applications, Lawrence Rabiner and Ronald W. Schafer, 2011 (Pearson) - Offers in-depth coverage of speech signal processing techniques, including methods for noise reduction and feature extraction relevant to robust speech recognition in real-world environments.
Hands-On Speech Recognition with Python, Michael F. S. Chen, 2019 (Packt Publishing) - A practical guide demonstrating how to build speech recognition applications using Python, covering libraries like SpeechRecognition and addressing real-world challenges such as live audio processing and noise management.