Speech Recognition and Speech to Text With Whisper

Lately my son needed to prepare a short presentation for school. So we looked around the internet and found five interesting websites and two videos about the topic. The videos were about three and six minutes. So my idea was to put all this information together and get a summary of all that information from ChatGPT. While you can upload quite a lot of different formats or media to ChatGPT nowadays it can get expensive if you upload a lot of audio and video for processing. So doing locally might save some money.

But it’s not only about this use case. E.g. you found a video tutorial on Youtube you can download it and use Whisper to get a transcription of that tutorial. Now you can put that text into your local search engine or if you have a local AI tool running like Ollama you can upload this text there and ask questions about it. That becomes esp. interesting if you have lots of text and extracted audio from various sources and use AI to accomplish certain tasks. This makes text, audio and video easily accessable and esp. searchable.

I’m using Ubuntu 24.04 for this blog post. If you’ve a different OS or Ubuntu version some commands might differ but most instructions should work for other OSes to. The transcription also works without a GPU. Using only the CPU is not the fastest thing in the world but it’s quite ok. I’ve a Nvidia GPU which is used by Whisper automatically if detected. For this it’s normally good enough to have the drivers installed. If you haven’t done so you can follow the guidelines in my blog post Setting up environment for Stable Diffusion for image creation. The first part of that blog post descibes the Nvidia driver installation on Ubuntu.

So to get the transcription of what was spoken in the videos I came accross Whisper from OpenAI as mentioned above. Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. I only tried Englich and German. But at least for these two languages the result is pretty good.

Since we need a bit of Python tooling and since I don’t want to mess with the system Python the first tool I install is uv. For me this is THE ulitmate tool for all Python needs meanwhile. It’s an extremely fast Python package and project manager - and it’s really fast 😉 Funny enough it’s written in Rust 😄 AI tooling, libraries, and so on often depends on specific Python versions. Handling different Python versions is very easy with uv. And also managing virtual environments, installing programs and pulling dependencies. Ubuntu 24.04 uses Python 3.12 by default. Currently that’s not supported by Whisper. It only works with Python 3.9 to 3.11. This also where uv helps a lot.

The easiest way to install uv is pip (yeah, I know…):

bash

sudo apt install python3-pip

To install uv:

bash

pip install --break-system-packages --user uv

Why is --break-system-packages needed? Starting with Python 3.11 and PEP 668 (Marking Python base environments as “externally managed”) using pip to install Python packages will give only an error message: error: externally-managed-environment at first. To install Python packages you should use the OS package manager like apt install python3-xxx. But uv isn’t available in Ubuntu 24.04 package repository. While it’s a good thing to not mix Python packages managed by the OS and Python packages installed via pip, the error message makes no sense if pip install --user ... is used IMHO. Using --user will install everything in $HOME/.local directory by default and nothing of the system Python will be changed. For that reason I don’t get why this error message appears if you use pip install --user .... But I guess some clever people have had a reason to do decide otherwise 😉

So the uv binary will end up in $HOME/.local/bin. To make the tool available we need to extend the PATH variable. E.g.:

bash

export PATH=$HOME/.local/bin:$PATH

To make that change permanent you can put this line also in $HOME/.bashrc and load the change with source .bashrc e.g.

Now lets install a Python virtual environment with Python 3.11 (--python 3.11) and also add some tools like pip (--seed):

bash

uv venv --python 3.11 --seed audio_extract

This will create a directory called audio_extract. Lets enter that directory and activate the Python virtual environment (“activate” means that the python and pip binary of that virtual environment will be used and not the ones that were installed with the OS package manager e.g.):

bash

cd audio_extract
source bin/activate

To extract audio from a video I’ll use ffmpeg. On Ubuntu it can be installed via

bash

sudo apt install ffmpeg

So lets assume you’ve found a tutorial on Youtube. To extract the audio of that video we need to download the video. This can be done with another Python utility called yt-dlp. Lets install that one with uv:

bash

uv tool install yt-dlp

The video I want to transcribe is The Astronomer - Walt Whitman (Powerful Life Poetry):

Since we only need the audio stream of that video let’s see what formats are offered for that video:

bash

yt-dlp --list-formats https://www.youtube.com/watch?v=uhuTG3CCY-w

The list for this video is pretty long but I’m only interested for the audio only formats:

plain

[info] Available formats for uhuTG3CCY-w:
ID      EXT   RESOLUTION FPS CH │   FILESIZE    TBR PROTO │ VCODEC           VBR ACODEC      ABR ASR MORE INFO
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
233     mp4   audio only        │                   m3u8  │ audio only           unknown             [en] Default
234     mp4   audio only        │                   m3u8  │ audio only           unknown             [en] Default
599-drc m4a   audio only      2 │  341.04KiB    31k https │ audio only           mp4a.40.5   31k 22k [en] ultralow, DRC, m4a_dash
600-drc webm  audio only      2 │  389.54KiB    35k https │ audio only           opus        35k 48k [en] ultralow, DRC, webm_dash
599     m4a   audio only      2 │  340.87KiB    31k https │ audio only           mp4a.40.5   31k 22k [en] ultralow, m4a_dash
600     webm  audio only      2 │  387.38KiB    35k https │ audio only           opus        35k 48k [en] ultralow, webm_dash
139-drc m4a   audio only      2 │  540.02KiB    49k https │ audio only           mp4a.40.5   49k 22k [en] low, DRC, m4a_dash
249-drc webm  audio only      2 │  566.72KiB    51k https │ audio only           opus        51k 48k [en] low, DRC, webm_dash
250-drc webm  audio only      2 │  742.26KiB    67k https │ audio only           opus        67k 48k [en] low, DRC, webm_dash
139     m4a   audio only      2 │  539.74KiB    49k https │ audio only           mp4a.40.5   49k 22k [en] low, m4a_dash
249     webm  audio only      2 │  564.42KiB    51k https │ audio only           opus        51k 48k [en] low, webm_dash
250     webm  audio only      2 │  738.40KiB    67k https │ audio only           opus        67k 48k [en] low, webm_dash
140-drc m4a   audio only      2 │    1.40MiB   130k https │ audio only           mp4a.40.2  130k 44k [en] medium, DRC, m4a_dash
251-drc webm  audio only      2 │    1.43MiB   132k https │ audio only           opus       132k 48k [en] medium, DRC, webm_dash
140     m4a   audio only      2 │    1.40MiB   130k https │ audio only           mp4a.40.2  130k 44k [en] medium, m4a_dash
251     webm  audio only      2 │    1.43MiB   132k https │ audio only           opus       132k 48k [en] medium, webm_dash

And I want the best quality. So I’ll choose the format with ID 251 (the last one in the list above) and download it:

bash

yt-dlp --output video.webm --format 251 https://www.youtube.com/watch?v=uhuTG3CCY-w

Now I’ve a file called video.webm with the audio stream.

Now it’s finally time to install Whipser. In this case I use pip again. Since I’ve activated the virtual environment I’m now using pip of that venv and not the one of the OS base installation. Everything Python related now happens within that venv. So if you mess something up just delete the audio_extract folder and start from scratch. But now:

bash

pip install git+https://github.com/openai/whisper.git

This could take a bit depending on your Internet connection as this will download around 4-5 GByte of Python libraries and stuff like that.

bash

whisper video.webm --language English

This will again download another 1.5 Gbyte of data (but only during the very first run). Ok, lets see what we get:

text

[00:00.000 --> 00:18.920]  When I heard the learned astronomer, when the proofs, the figures, were ranged in columns
[00:18.920 --> 00:30.740]  before me, when I was shown the charts and diagrams to add, divide, and measure them, when
[00:30.740 --> 00:39.440]  I sitting heard the astronomer where he lectured with much applause in the lecture room, how
[00:39.440 --> 00:50.060]  soon, unaccountable, I became tired and sick, till rising and gliding out, I wandered off
[00:50.060 --> 01:01.440]  by myself, in the mystical moist night air, and from time to time, looked up in perfect
[01:01.440 --> 01:04.680]  silence at the stars.

Not bad 😉 By default Whisper uses the turbo model (see Available models and languages). That requires around 6 GByte of VRAM. So a GPU with around 8 GB of VRAM is needed to fit the model in VRAM (if you use GPU). I also tried the large model but the result was any better (partly even worse). To try different models just use --model medium e.g.