AI podcast transcripts with speaker detection

This is part 2 of my last blog post about podcast transcription with Whisper. There I uses OpenAI/Whisper to transcribe one of our podcasts.
One reader reached out to me and asked: how can you also distinguish speakers?

Thanks so much for asking!

It is not possible with Whisper alone, but the question got stuck. A colleague of mine hinted me to a tiny Github project consisting of brief documentation, CLI commands and code snippets. Whisper transcripts plus speaker recognition: it hits the nail on the head. Great thanks for publishing those findings.

Approach – How it works

TL;DR Use for speaker diarization, afterwards OpenAI/Whisper for transcription.

The author uses another dedicated AI for speaker diarization (when is who speaking?). Whisper still takes care of the transcription. The clue about it: before the execution of Whisper, the original audio file is split into small chunks. In each chunk you can hear only one speaker. All chunks in order result in the original podcast (mostly).

Metrics – What happens

Episode 45 of our podcast is the guinea pig again: 1:27h long, two German speakers speaking … well … German.

On one of our medium size in-house servers, the overall process takes 12h. There is room for improvement, but I solely focussed on a working prototype. I expect the performance to increase significantly with better hardware support and improvements to the scripts. The original audio files is segmented into 311 individual parts.

Result – The transcript

Here are the first minutes of the transcript. You can listen along while reading. The full transcript is available here. Though not perfect the result is really good.

Herzlich willkommen zu einer neuen Folge vom Sandpapier, dem regelmäßig unregelmäßigen Podcast von Sandstorm, in dem wir über unsere Herausforderungen und Themen aus unserem Alltag als Softwareagentur sprechen. Heute haben wir eine Folge geplant über das Thema Sabena, wer und was das ist, besprechen wir heute. Dazu tauschen sich heute aus die Karo, hallo Karo.

Hallo, lieber Tobi.

Und ich, der Tobias. Genau. Wir haben… Wann hat das eigentlich angefangen? Sabena Karo. Wann hat das gestartet? Letztes Jahr.

letztes Jahr noch im Herbst. Ich habe im September angefangen und ich glaube, ich bin ziemlich früh damit reingekommen. Da haben sich alle noch gefragt, was will sie denn damit. Das könnte spätestens Oktober gewesen sein oder aller spätestens November, dass wir darüber gesprochen haben, ob wir das machen.

Also, wir reflektieren heute mal ein Thema, ein Projekt, was wir, was jetzt ein gut ein Jahr her ist, also gut ein Jahr und so lange beschäftigt und begleitet haben.

Also gut ein Jahr und so lange beschäftigt und begleitet.

Und es hat wieder mit unserer Nachhaltigkeitsreise zu tun. Deswegen sehen wir das Ganze hier als Fortsetzung unserer Serie zu dem Thema Nachhaltigkeit, die wir als Softwareagentur anstreben. Wir hatten ja schon die Folgen Shift Happens und die letzte Folge zum Thema war Nachhaltige Software.

Genau, mit Christoph und Sebastian. Meinst du, wir kriegen noch zusammen, was alles seit Shift Happens so passiert ist? Meinst du? Fällt dir was? Also, ich habe ein bisschen gespickt, ich muss sagen, die Zeit ist so gerannt. Das ist es.

Also wenn ich jetzt grübe…


Ich würde solche Sachen wie Girl’s Day Academy wahrscheinlich dazu zählen.

Genau, die Girl’s Take Akademie haben wir gemacht.

Mal wieder mal einen Klimastreik.

Limitations – Where things get tricky

If multiple speaker talk at the same time, of course, the algorithm is in trouble. Some audio segments might be assigned to speaker 1 and as well to speaker 2. This might result in repetitions within the transcript. No big deal usually, but something to keep in mind.

Currently the script assumes that there are exactly two speakers. I still need to add a parameter for that. Same goes for the length of the intro.

Last thing: avoid whitespaces in the path to the input file. Just rename it.

Code - The implementation details

As said I got the overall idea from and implemented an own version of it.

System setup

You need at least Python 3.9 on your system. Also ensure you have the following packages installed: zlib1g-dev, libffi-dev, ffmpeg, libbz2-dev, liblzma-dev and ack. Those are Debian packages. Names might differ for other distributions.

I usually work on project scoped virtual environments. You can create one and install the needed Python libraries as follows.

python3 -m venv ./venv source venv/bin/activate pip install git+ pip install pip install pydub

The speaker diarization

In my case the quick start example from pyannote/speaker-diarization already is sufficient.

# see # # 1. visit and accept user conditions # 2. visit and accept user conditions # 3. visit to create an access token # 4. instantiate pretrained speaker diarization pipeline import os import sys from import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization@2.1", use_auth_token=os.getenv('API_ACCESS_TOKEN')) # apply the pipeline to an audio file diarization = pipeline(sys.argv[1], num_speakers=2) print(str(diarization))

The result looks like

[ 00:00:00.497 --> 00:00:00.987] Z SPEAKER_01 [ 00:00:01.425 --> 00:00:01.611] A SPEAKER_00 [ 00:00:12.985 --> 00:00:13.829] AA SPEAKER_01 [ 00:00:14.605 --> 00:00:31.598] B SPEAKER_00 [ 00:00:32.475 --> 00:00:46.144] C SPEAKER_00 …

The audio segmentation

This is the script segmenting the original audio files into 311 individual chunks, in the case of episode 45. If you play all chunks in order, you hear the full podcast except for the intro and some moments of silence.

import re import sys import os from typing import Optional, Sequence from pydub import AudioSegment # CLI arguments diarization_file: str = sys.argv[1] audio_file: str = sys.argv[2] class SpeakerChunk: startMillis: int endMillis: int speaker: str def __init__(self, startMillis, endMillis, speaker) -> None: self.startMillis = startMillis self.endMillis = endMillis self.speaker = speaker def __str__(self): return str(self.startMillis) + " --> " + str(self.endMillis) + ": " + self.speaker def readLines(path: str) -> Sequence[str]: """ Returns the content of the file at the given path as an array of lines. """ handle = open(path, 'r') lines = handle.readlines() handle.close() return lines def parseToMillis(timeString: str) -> int: """ Parses a given string in the form 00:04:34.885 to the according milliseconds. """ parts = timeString.split(":") return (int)((int(parts[0]) * 60 * 60 + int(parts[1]) * 60 + float(parts[2]) )* 1000) def readDiarizationFile(path: str) -> Sequence[SpeakerChunk]: """ Returns the content of the diarization file at the given path as an array of SpeakerChunks. """ # example line: [ 00:00:00.497 --> 00:00:00.987] AT SPEAKER_01 timeRegex = '(\d{2}:\d{2}:\d{2}\.\d{3})' lineRegex = '^\s*' + \ '\[\s*' + \ timeRegex + \ '\s*-->\s*' + \ timeRegex + \ '\s*\]' + \ '\s*\w+\s*' + \ '(\w+)' pattern = re.compile(lineRegex) lines = readLines(path) chunks = [] for line in lines: match = pattern.match(line) if match: chunks.append(SpeakerChunk( parseToMillis(, parseToMillis(, else: print("failed to parse line: " + line, file=sys.stderr) return chunks def compactChunks(chunks: Sequence[SpeakerChunk]) -> Sequence[SpeakerChunk]: """ Merges neighbouring chunks of the same speaker into one. """ result = [] nextResultItem = None for chunk in chunks: if nextResultItem is None: nextResultItem = chunk else: if nextResultItem.speaker == chunk.speaker: nextResultItem.endMillis = chunk.endMillis else: result.append(nextResultItem) nextResultItem = chunk result.append(nextResultItem) return result chunks = compactChunks(readDiarizationFile(diarization_file)) index = 0 audio_name, audio_extension = os.path.splitext(audio_file) audio = AudioSegment.from_wav(audio_file) for chunk in chunks: index += 1 audio[chunk.startMillis:chunk.endMillis].export(audio_name + "_chunk-" + str(index) + "_" + chunk.speaker + audio_extension)

The result looks like

45-sabena_chunk-1_SPEAKER_00.wav 45-sabena_chunk-2_SPEAKER_01.wav 45-sabena_chunk-3_SPEAKER_00.wav 45-sabena_chunk-4_SPEAKER_01.wav 45-sabena_chunk-5_SPEAKER_00.wav …

Putting it all together

One audio file in, one transcript out… well almost. The result in transcript.txt does not contains the speaker names yet, but only anonymous placeholders, like 1_SPEAKER_00 or 12_SPEAKER_00. Note that the leading number is the number of the audio chunk and the trailing number is the number of the speaker.
You have to replace those tokens by the names of the speaker by hand with a clever search-replace.

Usage: ./ …/45-sabena.mp4

After execution I find all results in ./45-sabena.

#!/usr/bin/env bash echo "Validating CLI arguments" AUDIO_SOURCE=$1 if [ -z "$AUDIO_SOURCE" ] then echo "usage: $0 path/to/audio.mp3" exit 1 fi if [ -z "$API_ACCESS_TOKEN" ] then echo "Please set API_ACCESS_TOKEN to your user token" echo "export API_ACCESS_TOKEN=…" exit 2 fi set -e AUDIO_FILE=$(basename $AUDIO_SOURCE) AUDIO_NAME=${AUDIO_FILE%.*} mkdir $AUDIO_NAME || echo "folder $AUDIO_NAME already exists" AUDIO_WAV=${AUDIO_NAME}.wav echo "Converting $AUDIO_SOURCE to $AUDIO_WAV and removing intro" # first 14s are the intro ffmpeg -ss 14 -i $AUDIO_SOURCE $AUDIO_NAME/$AUDIO_WAV pushd $AUDIO_NAME SCRIPT_DIR=../$(dirname $0) echo "Running on $AUDIO_WAV" AUDIO_DIARIZATION=$AUDIO_NAME-diarization.txt python $SCRIPT_DIR/ $AUDIO_WAV > $AUDIO_DIARIZATION echo "Running on $AUDIO_WAV" python $SCRIPT_DIR/ $AUDIO_DIARIZATION $AUDIO_WAV echo "Transcribing chunks" for chunk_file in *chunk-*.wav; do echo "Transcribing $chunk_file" whisper --language German --model medium $chunk_file > $chunk_file.std done echo "Merging transcript files" ls *.wav.txt \ | ack -o '\d+_SPEAKER_\d+' \ | sort --human-numeric-sort \ | xargs -Imarker bash -c 'echo "" && echo marker && cat *_chunk-marker.wav.txt' \ | sed -E ':a;N;$!ba;s/([^\n0-9])[\n](\w)/\1 \2/g' \ > transcript.txt echo "Done"

Failures – What did not work

In my first two approaches I tried to use Whisper first and then kind of merge the transcript with the speaker diarization. I tried the public recording and the single-speaker tracks. We record each speaker individually for our podcasts.

In both cases the time granularity Whisper provides with the default settings does not allow to distinguish between speakers. If you have a speaker change at second 8 and seconds 10 with the transcript like:

[00:06.240 --> 00:14.360] Welcome to today’s Podcast. I am your host Tobias. And I am Karo. Welcome Karo.

Then it just does not work.


The current setup is able to transcribe our podcasts. The quality of the result is sufficient and the transcripts add value to the podcast blog articles. The performance leaves something to be desired, but I am sure we can improve it.

Thanks for reading. As usual feedback is always welcome.


All code examples in this article are licensed under the MIT license. So enjoy.