Christoph Dähne | 25.09.2022
Automate Podcast transcripts with OpenAI/Whisper
- Tech
This is part 2 of my last blog post about podcast transcription with Whisper. There I uses OpenAI/Whisper to transcribe one of our podcasts.
One reader reached out to me and asked: how can you also distinguish speakers?
Thanks so much for asking!
It is not possible with Whisper alone, but the question got stuck. A colleague of mine hinted me to a tiny Github project consisting of brief documentation, CLI commands and code snippets. Whisper transcripts plus speaker recognition: it hits the nail on the head. Great thanks for publishing those findings.
TL;DR Use pyannote.audio for speaker diarization, afterwards OpenAI/Whisper for transcription.
The author uses another dedicated AI for speaker diarization (when is who speaking?). Whisper still takes care of the transcription. The clue about it: before the execution of Whisper, the original audio file is split into small chunks. In each chunk you can hear only one speaker. All chunks in order result in the original podcast (mostly).
Episode 45 of our podcast is the guinea pig again: 1:27h long, two German speakers speaking … well … German.
On one of our medium size in-house servers, the overall process takes 12h. There is room for improvement, but I solely focussed on a working prototype. I expect the performance to increase significantly with better hardware support and improvements to the scripts. The original audio files is segmented into 311 individual parts.
Here are the first minutes of the transcript. You can listen along while reading. The full transcript is available here. Though not perfect the result is really good.
Tobias
Herzlich willkommen zu einer neuen Folge vom Sandpapier, dem regelmäßig unregelmäßigen Podcast von Sandstorm, in dem wir über unsere Herausforderungen und Themen aus unserem Alltag als Softwareagentur sprechen. Heute haben wir eine Folge geplant über das Thema Sabena, wer und was das ist, besprechen wir heute. Dazu tauschen sich heute aus die Karo, hallo Karo.
Karo
Hallo, lieber Tobi.
Tobias
Und ich, der Tobias. Genau. Wir haben… Wann hat das eigentlich angefangen? Sabena Karo. Wann hat das gestartet? Letztes Jahr.
Karo
letztes Jahr noch im Herbst. Ich habe im September angefangen und ich glaube, ich bin ziemlich früh damit reingekommen. Da haben sich alle noch gefragt, was will sie denn damit. Das könnte spätestens Oktober gewesen sein oder aller spätestens November, dass wir darüber gesprochen haben, ob wir das machen.
Tobias
Also, wir reflektieren heute mal ein Thema, ein Projekt, was wir, was jetzt ein gut ein Jahr her ist, also gut ein Jahr und so lange beschäftigt und begleitet haben.
Karo
Also gut ein Jahr und so lange beschäftigt und begleitet.
Tobias
Und es hat wieder mit unserer Nachhaltigkeitsreise zu tun. Deswegen sehen wir das Ganze hier als Fortsetzung unserer Serie zu dem Thema Nachhaltigkeit, die wir als Softwareagentur anstreben. Wir hatten ja schon die Folgen Shift Happens und die letzte Folge zum Thema war Nachhaltige Software.
Karo
Genau, mit Christoph und Sebastian. Meinst du, wir kriegen noch zusammen, was alles seit Shift Happens so passiert ist? Meinst du? Fällt dir was? Also, ich habe ein bisschen gespickt, ich muss sagen, die Zeit ist so gerannt. Das ist es.
Tobias
Also wenn ich jetzt grübe…
Karo
Ja.
Tobias
Ich würde solche Sachen wie Girl’s Day Academy wahrscheinlich dazu zählen.
Karo
Genau, die Girl’s Take Akademie haben wir gemacht.
Tobias
Mal wieder mal einen Klimastreik.
If multiple speaker talk at the same time, of course, the algorithm is in trouble. Some audio segments might be assigned to speaker 1 and as well to speaker 2. This might result in repetitions within the transcript. No big deal usually, but something to keep in mind.
Currently the script assumes that there are exactly two speakers. I still need to add a parameter for that. Same goes for the length of the intro.
Last thing: avoid whitespaces in the path to the input file. Just rename it.
As said I got the overall idea from github.com/Majdoddin/nlp and implemented an own version of it.
You need at least Python 3.9 on your system. Also ensure you have the following packages installed: zlib1g-dev, libffi-dev, ffmpeg, libbz2-dev, liblzma-dev and ack. Those are Debian packages. Names might differ for other distributions.
I usually work on project scoped virtual environments. You can create one and install the needed Python libraries as follows.
In my case the quick start example from pyannote/speaker-diarization already is sufficient.
The result looks like
This is the script segmenting the original audio files into 311 individual chunks, in the case of episode 45. If you play all chunks in order, you hear the full podcast except for the intro and some moments of silence.
The result looks like
One audio file in, one transcript out… well almost. The result in transcript.txt does not contains the speaker names yet, but only anonymous placeholders, like 1_SPEAKER_00 or 12_SPEAKER_00. Note that the leading number is the number of the audio chunk and the trailing number is the number of the speaker.
You have to replace those tokens by the names of the speaker by hand with a clever search-replace.
Usage: ./transcribe-podcast.sh …/45-sabena.mp4
After execution I find all results in ./45-sabena.
In my first two approaches I tried to use Whisper first and then kind of merge the transcript with the speaker diarization. I tried the public recording and the single-speaker tracks. We record each speaker individually for our podcasts.
In both cases the time granularity Whisper provides with the default settings does not allow to distinguish between speakers. If you have a speaker change at second 8 and seconds 10 with the transcript like:
[00:06.240 --> 00:14.360] Welcome to today’s Podcast. I am your host Tobias. And I am Karo. Welcome Karo.
Then it just does not work.
The current setup is able to transcribe our podcasts. The quality of the result is sufficient and the transcripts add value to the podcast blog articles. The performance leaves something to be desired, but I am sure we can improve it.
Thanks for reading. As usual feedback is always welcome.
All code examples in this article are licensed under the MIT license. So enjoy.
Dein Besuch auf unserer Website produziert laut der Messung auf websitecarbon.com nur 0,28 g CO₂.