Christoph Dähne | 29.11.2022
AI podcast transcripts with speaker detection
- Tech
In my last blog post I described an algorithm to use Pyannote and Whisper for describing our podcast. Today I want to share my experience applying it to our German podcasts. All podcasts are transcribed, each required some manual work, but still, I'm happy with the result.
Unsurprisingly, the most challenging part in the podcast is the welcoming, the intro section, and saying goodbye, the outro. There the speakers tend to interrupt each other, speak at the same time, and speakers switch very frequently. Sometimes speakers talk only a few hundred milliseconds at a time, ie when saying: "Hi!".
But still, the results are quite pleasing. In order to be perfect, a human reader would have to read them, and correct typos, misspelled names and occasional incorrect wrong words, but my speaker recognition approach hardly reduces the quality of the transcript.
Due to a lack of time, I only corrected the intro and the outro, and left the middle parts intact. As I said, in the middle part, speakers tend to interrupt each other less often and tend to talk longer periods of time.
I tweaked the prior algorithm such that I ignore any comments or any interruption attempts, if the speaker is interrupting another speaker less than one second. This is for two reasons:
First, whisper has trouble with speech recognition for very short samples and second, most of the time, those interruptions carry little meaning. They are either some reaction to what is said. Usually serving as a cliffhanger for the following answer, or they are failed attempts to interrupt the current speaker. In both cases those fragments only have value while listening and hardly any while reading. You can listen to the tone and to the intent of the speakers on the podcast. In the transcription the actual words become more important and interruptions rather annoying to the reader.
You can check the results for our German podcasts yourself. I'm overall satisfied with the results. I achieved my goal of making the podcast more accessible and hopefully easier to find via full-text search engines such as Google. Today I cannot say if it makes any difference, though. Nonetheless, it was a nice task to learn to work with the AI models and to tweak the small details, as I said, with the speaker recognition and detection of interruptions.
As always, I want to share the code and it is licensed as MIT, so feel free to use it however you like. Note that you still need the other files from this blog post.
Feel free to read the prior posts about audio transcription using Whisper:
On a side note: I started to record and transcribe my blog posts – this one included. I had a nice walk in a snowy evening. I use Whisper to transcribe the recording and work over it afterwards.
As always feedback is highly welcome. If you have any comments or suggestions, feel free to contact us and thanks for reading.
Dein Besuch auf unserer Website produziert laut der Messung auf websitecarbon.com nur 0,28 g CO₂.