Last mod: 2025.01.11

Raspberry Pi - Speech to text based on Vosk

Speech-to-text functionality is becoming so important in IoT and robotics, right? It makes interacting with devices so natural - no need to press buttons or type commands anymore. You can just talk, and the system understands. It’s especially useful in smart homes and industrial settings, where quick, hands-free interaction is key. And in robotics, it adds a whole new dimension. Robots can take verbal instructions and respond, which is incredible for tasks like assisting in healthcare or even customer service. It’s like giving machines a voice and ears - it makes the whole experience so much more human-friendly!

Hardware and Software

Raspberry Pi with microphone

Microphone

The Raspbery Pi does not include a built-in microphone. For this reason, we have to use an external one connected via USB. In the example I use the CAD Audio U9 USB MiniMic.

First we need to check that the system detects the microphone:

lsusb

lsusb

The device of interest is:

Bus 001 Device 004: ID 0d8c:0139 C-Media Electronics, Inc. Multimedia Headset [Gigaware by Ignition L.P.]

Vosk installation

Prepare and run dedicated enviroment:

python3 -m venv env_vosk
source env_vosk/bin/activate

Install required libraries and Vosk:

sudo apt install portaudio19-dev
pip install pyaudio vosk

Create directory form model and app example:

mkdir RaspberryPi_SpeechToText
cd RaspberryPi_SpeechToText/

Download and unpack small English model:

wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

Create Python file vosk_app.py:

import os
import sys
import wave
import json
import pyaudio
from vosk import Model, KaldiRecognizer


model = Model("vosk-model-small-en-us-0.15")
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
stream.start_stream()

recognizer = KaldiRecognizer(model, 16000)

print("Please speak up...")

while True:
    data = stream.read(3000)
    if recognizer.AcceptWaveform(data):
        result = recognizer.Result()
        print(json.loads(result)["text"])

And run:

python3 vosk_app.py

Wait few secont and test, say "hello world" or other short sentence. Example result:

python3 vosk_app.py and "hello world"

Links

https://www.cadaudio.com/products/product-application/u9
https://alphacephei.com/vosk/
https://alphacephei.com/vosk/models
https://gitlab.com/dziak.tech/examples/-/tree/main/IoT/RaspberryPi_SpeechToText