In my last blog post I described an algorithm to use Pyannote and Whisper for describing our podcast. Today I want to share my experience applying it to our German podcasts. All podcasts are transcribed, each required some manual work, but still, I'm happy with the result.
Unsurprisingly, the most challenging part in the podcast is the welcoming, the intro section, and saying goodbye, the outro. There the speakers tend to interrupt each other, speak at the same time, and speakers switch very frequently. Sometimes speakers talk only a few hundred milliseconds at a time, ie when saying: "Hi!".
But still, the results are quite pleasing. In order to be perfect, a human reader would have to read them, and correct typos, misspelled names and occasional incorrect wrong words, but my speaker recognition approach hardly reduces the quality of the transcript.
Due to a lack of time, I only corrected the intro and the outro, and left the middle parts intact. As I said, in the middle part, speakers tend to interrupt each other less often and tend to talk longer periods of time.
I tweaked the prior algorithm such that I ignore any comments or any interruption attempts, if the speaker is interrupting another speaker less than one second. This is for two reasons:
First, whisper has trouble with speech recognition for very short samples and second, most of the time, those interruptions carry little meaning. They are either some reaction to what is said. Usually serving as a cliffhanger for the following answer, or they are failed attempts to interrupt the current speaker. In both cases those fragments only have value while listening and hardly any while reading. You can listen to the tone and to the intent of the speakers on the podcast. In the transcription the actual words become more important and interruptions rather annoying to the reader.
You can check the results for our German podcasts yourself. I'm overall satisfied with the results. I achieved my goal of making the podcast more accessible and hopefully easier to find via full-text search engines such as Google. Today I cannot say if it makes any difference, though. Nonetheless, it was a nice task to learn to work with the AI models and to tweak the small details, as I said, with the speaker recognition and detection of interruptions.
As always, I want to share the code and it is licensed as MIT, so feel free to use it however you like. Note that you still need the other files from this blog post.
import re
import sys
import os
from typing import Optional, Sequence
from pydub import AudioSegment
# CLI arguments
diarization_file: str = sys.argv[1]
audio_file: str = sys.argv[2]
class SpeakerChunk:
startMillis: int
endMillis: int
speaker: str
def __init__(self, startMillis, endMillis, speaker) -> None:
self.startMillis = startMillis
self.endMillis = endMillis
self.speaker = speaker
def __str__(self):
return str(self.startMillis) + " --> " + str(self.endMillis) + ": " + self.speaker
def readLines(path: str) -> Sequence[str]:
"""
Returns the content of the file at the given path as an array of lines.
"""
handle = open(path, 'r')
lines = handle.readlines()
handle.close()
return lines
def parseToMillis(timeString: str) -> int:
"""
Parses a given string in the form 00:04:34.885 to the according milliseconds.
"""
parts = timeString.split(":")
return (int)((int(parts[0]) * 60 * 60 + int(parts[1]) * 60 + float(parts[2]) )* 1000)
def readDiarizationFile(path: str) -> Sequence[SpeakerChunk]:
"""
Returns the content of the diarization file at the given path
as an array of SpeakerChunks.
"""
# example line: [ 00:00:00.497 --> 00:00:00.987] AT SPEAKER_01
timeRegex = '(\d{2}:\d{2}:\d{2}\.\d{3})'
lineRegex = '^\s*' + \
'\[\s*' + \
timeRegex + \
'\s*-->\s*' + \
timeRegex + \
'\s*\]' + \
'\s*\w+\s*' + \
'(\w+)'
pattern = re.compile(lineRegex)
lines = readLines(path)
chunks = []
for line in lines:
match = pattern.match(line)
if match:
chunks.append(SpeakerChunk(
parseToMillis(match.group(1)),
parseToMillis(match.group(2)),
match.group(3)))
else:
print("failed to parse line: " + line, file=sys.stderr)
return chunks
def compactChunks(chunks: Sequence[SpeakerChunk]) -> Sequence[SpeakerChunk]:
"""
Merges neighbouring chunks of the same speaker into one.
"""
result = []
nextResultItem = None
for chunk in chunks:
if nextResultItem is None:
nextResultItem = chunk
else:
if nextResultItem.speaker == chunk.speaker:
nextResultItem.endMillis = chunk.endMillis
else:
result.append(nextResultItem)
nextResultItem = chunk
result.append(nextResultItem)
return result
def filterInterrupts(chunks: Sequence[SpeakerChunk]) -> Sequence[SpeakerChunk]:
"""
Sometimes speakers try to interrupt the current speaker. This results in meaningless
half-spoken words. We want to filter those, if 1) they are very short and 2) another
speaker is speaking at the same time. Whisper has trouble with those short artefacts
and detects words randomly.
"""
result = []
for chunk in chunks:
if bool(result):
lastChunk = result[-1]
isSameSpeaker = lastChunk.speaker == chunk.speaker
isAfterLastChunk = lastChunk.endMillis <= chunk.startMillis
length = chunk.endMillis - chunk.startMillis
if isSameSpeaker or isAfterLastChunk or length >= 1_000:
result.append(chunk)
else:
result.append(chunk)
return result
chunks = compactChunks(
filterInterrupts(
readDiarizationFile(diarization_file)))
index = 0
if audio_file == '--debug':
for chunk in chunks:
print(chunk)
else:
audio_name, audio_extension = os.path.splitext(audio_file)
audio = AudioSegment.from_wav(audio_file)
for chunk in chunks:
index += 1
audio[chunk.startMillis:chunk.endMillis].export(audio_name + "_chunk-" + str(index) + "_" + chunk.speaker + audio_extension)
Feel free to read the prior posts about audio transcription using Whisper:
On a side note: I started to record and transcribe my blog posts – this one included. I had a nice walk in a snowy evening. I use Whisper to transcribe the recording and work over it afterwards.
As always feedback is highly welcome. If you have any comments or suggestions, feel free to contact us and thanks for reading.