While transcribing podcasts I ran the transcription on a server. Otherwise I would not be able to use my laptop for hours at a time. A college of mine hinted me to whisper.cpp, a nice Github project. It is a C++ based implementation of Whisper and is much more performant.
Also it can compile to WASM and run directly in the browser and do other cool stuff. I recommend you to check out the examples.
Back to local speech-to-text, I want to share a small shell script I use as my local whisper command. Please install ffmpeg. Also, you need to checkout and compile whisper.cpp and download the language model. Just see the hints in the script.
#!/usr/bin/env bash
set -e
_log_success() {
printf "\033[0;32m%s\033[0m\n" "${1}"
}
_log_error() {
printf "\033[0;31m%s\033[0m\n" "${1}"
}
INSTALL_PATH=~/src/github/ggerganov/whisper.cpp
if [ ! -d "$INSTALL_PATH" ]; then
_log_error "Failed to locate whisper.cpp"
echo "Please execute"
echo "mkdir -p ~/src/github/ggerganov"
echo "cd ~/src/github/ggerganov"
echo "git clone https://github.com/ggerganov/whisper.cpp.git"
exit 1
fi
BIN_PATH=$INSTALL_PATH/main
if [ ! -f "$BIN_PATH" ]; then
_log_error "Failed to locate binary"
echo "Please execute"
echo "cd $INSTALL_PATH"
echo "make"
exit 1
fi
MODEL_PATH=$INSTALL_PATH/models/ggml-medium.bin
if [ ! -f "$MODEL_PATH" ]; then
_log_error "Failed to locate language model"
echo "Please execute"
echo "cd $INSTALL_PATH"
echo "bash ./models/download-ggml-model.sh medium"
exit 1
fi
# last parameter is the source file
INPUT_PATH="${@: -1}"
# others are the ffmpeg flags
FFMPEG_FLAGS="${@:1:$(($#-1))}"
if [ ! -f "$INPUT_PATH" ]; then
_log_error "Input file not found at '$INPUT_PATH'"
echo "Usage: whisper [ffmpeg flag 1] [ffmpeg flag 2] […] path/to/audio.file"
echo "Example: whisper -ss 12 path/to/audio.file # start at second 12"
echo "Example: whisper -ss 12 -t 30 path/to/audio.file # start at second 12 and transcribe 30 seconds"
exit 1
fi
INPUT_FILE=$(basename "$INPUT_PATH")
INPUT_NAME=${INPUT_FILE%.*}
_log_success "Converting $INPUT_PATH …"
ffmpeg $FFMPEG_FLAGS -i "$INPUT_PATH" -acodec pcm_s16le -ac 1 -ar 16000 "$INPUT_NAME.wav"
_log_success "Transcribing $INPUT_PATH …"
$BIN_PATH -m $MODEL_PATH -l auto -otxt -of "$INPUT_NAME.raw" "$INPUT_NAME.wav"
_log_success
cat "$INPUT_NAME.raw.txt" | sed 's/^ *//' | sed 's/\[.*\]//' | sed '/^$/d' > "$INPUT_NAME.txt"
_log_success "Cleaning up …"
rm "$INPUT_NAME.wav"
rm "$INPUT_NAME.raw.txt"
_log_success "Done"
echo "See $INPUT_NAME.txt"
As always, license is MIT. So, feel free to use and share the script or your feedback.