Speech Recognition and Speech to Text With Whisper
Introduction
Lately my son needed to prepare a short presentation for school. So we looked around the internet and found five interesting websites and two videos about the topic. The videos were about three and six minutes. So my idea was to put all this information together and get a summary of all that information from ChatGPT. While you can upload quite a lot of different formats or media to ChatGPT nowadays it can get expensive if you upload a lot of audio and video for processing. So doing locally might save some money.
But it’s not only about this use case. E.g. you found a video tutorial on Youtube you can download it and use Whisper to get a transcription of that tutorial. Now you can put that text into your local search engine or if you have a local AI tool running like Ollama you can upload this text there and ask questions about it. That becomes esp. interesting if you have lots of text and extracted audio from various sources and use AI to accomplish certain tasks. This makes text, audio and video easily accessable and esp. searchable.
I’m using Ubuntu 24.04 for this blog post. If you’ve a different OS or Ubuntu version some commands might differ but most instructions should work for other OSes to. The transcription also works without a GPU. Using only the CPU is not the fastest thing in the world but it’s quite ok. I’ve a Nvidia GPU which is used by Whisper automatically if detected. For this it’s normally good enough to have the drivers installed. If you haven’t done so you can follow the guidelines in my blog post Setting up environment for Stable Diffusion for image creation. The first part of that blog post descibes the Nvidia driver installation on Ubuntu.
So to get the transcription of what was spoken in the videos I came accross Whisper from OpenAI as mentioned above. Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. I only tried Englich and German. But at least for these two languages the result is pretty good.
Install uv utility
Since we need a bit of Python tooling and since I don’t want to mess with the system Python the first tool I install is uv. For me this is THE ulitmate tool for all Python needs meanwhile. It’s an extremely fast Python package and project manager - and it’s really fast 😉 Funny enough it’s written in Rust 😄 AI tooling, libraries, and so on often depends on specific Python versions. Handling different Python versions is very easy with uv
. And also managing virtual environments, installing programs and pulling dependencies. Ubuntu 24.04 uses Python 3.12 by default. Currently that’s not supported by Whisper
. It only works with Python 3.9 to 3.11. This also where uv
helps a lot.
The easiest way to install uv
is pip
(yeah, I know…):
sudo apt install python3-pip
To install uv
:
pip install --break-system-packages --user uv
Why is --break-system-packages
needed? Starting with Python 3.11 and PEP 668 (Marking Python base environments as “externally managed”) using pip
to install Python packages will give only an error message: error: externally-managed-environment
at first. To install Python packages you should use the OS package manager like apt install python3-xxx
. But uv
isn’t available in Ubuntu 24.04 package repository. While it’s a good thing to not mix Python packages managed by the OS and Python packages installed via pip
, the error message makes no sense if pip install --user ...
is used IMHO. Using --user
will install everything in $HOME/.local
directory by default and nothing of the system Python will be changed. For that reason I don’t get why this error message appears if you use pip install --user ...
. But I guess some clever people have had a reason to do decide otherwise 😉
So the uv
binary will end up in $HOME/.local/bin
. To make the tool available we need to extend the PATH
variable. E.g.:
export PATH=$HOME/.local/bin:$PATH
To make that change permanent you can put this line also in $HOME/.bashrc
and load the change with source .bashrc
e.g.
Install Python virtual environment
Now lets install a Python virtual environment with Python 3.11 (--python 3.11
) and also add some tools like pip
(--seed
):
uv venv --python 3.11 --seed audio_extract
This will create a directory called audio_extract
. Lets enter that directory and activate the Python virtual environment (“activate” means that the python
and pip
binary of that virtual environment will be used and not the ones that were installed with the OS package manager e.g.):
cd audio_extract
source bin/activate
Install ffmpeg
To extract audio from a video I’ll use ffmpeg. On Ubuntu it can be installed via
sudo apt install ffmpeg
Install yt-dlp and download video
So lets assume you’ve found a tutorial on Youtube. To extract the audio of that video we need to download the video. This can be done with another Python utility called yt-dlp. Lets install that one with uv
:
uv tool install yt-dlp
The video I want to transcribe is The Astronomer - Walt Whitman (Powerful Life Poetry)
:
Since we only need the audio stream of that video let’s see what formats are offered for that video:
yt-dlp --list-formats https://www.youtube.com/watch?v=uhuTG3CCY-w
The list for this video is pretty long but I’m only interested for the audio only
formats:
[info] Available formats for uhuTG3CCY-w:
ID EXT RESOLUTION FPS CH │ FILESIZE TBR PROTO │ VCODEC VBR ACODEC ABR ASR MORE INFO
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
233 mp4 audio only │ m3u8 │ audio only unknown [en] Default
234 mp4 audio only │ m3u8 │ audio only unknown [en] Default
599-drc m4a audio only 2 │ 341.04KiB 31k https │ audio only mp4a.40.5 31k 22k [en] ultralow, DRC, m4a_dash
600-drc webm audio only 2 │ 389.54KiB 35k https │ audio only opus 35k 48k [en] ultralow, DRC, webm_dash
599 m4a audio only 2 │ 340.87KiB 31k https │ audio only mp4a.40.5 31k 22k [en] ultralow, m4a_dash
600 webm audio only 2 │ 387.38KiB 35k https │ audio only opus 35k 48k [en] ultralow, webm_dash
139-drc m4a audio only 2 │ 540.02KiB 49k https │ audio only mp4a.40.5 49k 22k [en] low, DRC, m4a_dash
249-drc webm audio only 2 │ 566.72KiB 51k https │ audio only opus 51k 48k [en] low, DRC, webm_dash
250-drc webm audio only 2 │ 742.26KiB 67k https │ audio only opus 67k 48k [en] low, DRC, webm_dash
139 m4a audio only 2 │ 539.74KiB 49k https │ audio only mp4a.40.5 49k 22k [en] low, m4a_dash
249 webm audio only 2 │ 564.42KiB 51k https │ audio only opus 51k 48k [en] low, webm_dash
250 webm audio only 2 │ 738.40KiB 67k https │ audio only opus 67k 48k [en] low, webm_dash
140-drc m4a audio only 2 │ 1.40MiB 130k https │ audio only mp4a.40.2 130k 44k [en] medium, DRC, m4a_dash
251-drc webm audio only 2 │ 1.43MiB 132k https │ audio only opus 132k 48k [en] medium, DRC, webm_dash
140 m4a audio only 2 │ 1.40MiB 130k https │ audio only mp4a.40.2 130k 44k [en] medium, m4a_dash
251 webm audio only 2 │ 1.43MiB 132k https │ audio only opus 132k 48k [en] medium, webm_dash
And I want the best quality. So I’ll choose the format with ID 251
(the last one in the list above) and download it:
yt-dlp --output video.webm --format 251 https://www.youtube.com/watch?v=uhuTG3CCY-w
Now I’ve a file called video.webm
with the audio stream.
Install Whisper
Now it’s finally time to install Whipser. In this case I use pip
again. Since I’ve activated the virtual environment I’m now using pip
of that venv and not the one of the OS base installation. Everything Python related now happens within that venv. So if you mess something up just delete the audio_extract
folder and start from scratch. But now:
pip install git+https://github.com/openai/whisper.git
This could take a bit depending on your Internet connection as this will download around 4-5 GByte of Python libraries and stuff like that.
Use Whisper
whisper video.webm --language English
This will again download another 1.5 Gbyte of data (but only during the very first run). Ok, lets see what we get:
[00:00.000 --> 00:18.920] When I heard the learned astronomer, when the proofs, the figures, were ranged in columns
[00:18.920 --> 00:30.740] before me, when I was shown the charts and diagrams to add, divide, and measure them, when
[00:30.740 --> 00:39.440] I sitting heard the astronomer where he lectured with much applause in the lecture room, how
[00:39.440 --> 00:50.060] soon, unaccountable, I became tired and sick, till rising and gliding out, I wandered off
[00:50.060 --> 01:01.440] by myself, in the mystical moist night air, and from time to time, looked up in perfect
[01:01.440 --> 01:04.680] silence at the stars.
Not bad 😉 By default Whisper
uses the turbo
model (see Available models and languages). That requires around 6 GByte of VRAM. So a GPU with around 8 GB of VRAM is needed to fit the model in VRAM (if you use GPU). I also tried the large
model but the result was any better (partly even worse). To try different models just use --model medium
e.g.