All That Whispers

I wanted to summarize the various projects and links related to OpenAI’s Whisper transcription software and model. Here is what I have discovered and noticed, collected into one place.

Other Implementations

Whisper.cpp

For Mac users, or anyone who doesn’t have access to a CUDA GPU for Pytorch, whisper.cpp almost certainly offers better performance than the python/pytorch implementation. In testing it’s about 50% faster than using pytorch and cpu.

Metal Backend. You might have heard that PyTorch supports the Apple Silicon GPU via the ‘metal shaders’ for Pytorch. And one might think that could speed up Whisper in Python on a Mac. Alas, the MPS backend is missing at least one needed operator (aten::repeat_interleave.self_int). So, in my testing it falls back to CPU, takes 20 times longer than the CPU backend by itself, and then … doesn’t actually work. (It returned gibberish.) You can track the various yet-unimplemented pytorch operations here.

MacWhisper

Based on whisper.cpp, we have MacWhisper which is a GUI application that will transcribe audio files. It doesn’t yet have a lot of export options (I’m hoping for .vtt export), but it is convenient all-in-one way to try out Whisper on a Mac.

Whisper in Huggingface Transformers- with Tensorflow support (on MacOS, the GPU sort of works!)

I was excited to see this Whisper in Transformers news because perhaps coupled with metal support in Tensorflow, I could finally use a GPU to do whisper transcription. I followed these instructions from Apple and then tried to do this colab notebook locally and it crashed with:

RuntimeError: Failed to import transformers.models.whisper.modeling_tf_whisper because of the following error (look up to see its traceback):
No module named 'keras.saving.hdf5_format'

UPDATE: The GPU! It does something! Ok, so I installed transformers from their github (pip install git+https://github.com/huggingface/transformers.git), the error above went a way, did everything in a fresh new conda environment, and I got this implementation of whisper to run on the M1 GPU via Tensorflow!

Speeds seem ironically comparable to whisper.cpp. 1m43s. vs 1m52s, and I had to cut out the first few seconds for the TF implementation as the music at the front of the episode confused the transcription. I have 8 GPU cores and 8 CPU cores so perhaps not that surprising? I’m a little surprised. My test was an episode of Robot or Not. I had to convert it to 32bit float wav file, and then pull that into a numpy array. Code below:

### ran this first ffmpeg -i robot250-newyear.mp3 -acodec pcm_f32le -ac 1 -ar 16000 output.wav
import scipy

_, audio = scipy.io.wavfile.read('output.wav')

import numpy as np
from transformers import TFWhisperForConditionalGeneration, WhisperProcessor

model = TFWhisperForConditionalGeneration.from_pretrained("openai/whisper-medium")
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
audio = audio[64000:]

audio = list(np.array_split(audio,8,))

inputs = processor.feature_extractor(audio,
                                     return_tensors="tf",
                                     sampling_rate=16_000,
                                     truncation=True).input_features
predicted_ids = model.generate(inputs, max_length=10000000)
processor.tokenizer.batch_decode(predicted_ids,
                                 skip_special_tokens=True,
                                 normalize=True)

I had to manually make a list of smaller arrays as it wouldn’t just work on one big numpy array. But look at the GPU go. A reliable and easy to install implementation that runs on the GPU will one day be pretty fast, especially on Apple Silicon macs with large GPU core counts.

GPU History on a Mac showing lots of activity.
GPU History on a Mac showing lots of activity.

CoreML

Two things I stumbled upon googling: this repository on an CoreML version of Whisper (not updated since the initial commit in September 2022), and this somewhat-cryptic blog post about a similar CoreML effort/app, yet unpublished effort.

This fork of the first repo seems the farthest (furthest?) along, but when I cloned the repo and tried to build it in Xcode, I did not have much success. (Specifically, it was looking for decoder and decoder_base.mlpackage files and even when I provided both, the compiler had other complaints.)

WhisperX

I just heard of WhisperX, with better timestamp accuracy and a beta of speaker identification.

The latest on my search_transcripts module

My unofficial ATP transcript search continues to be updated with new episodes. I added the ability to sort results by date, not just search result score. I may add a slider for limiting date range, not just episode number range.

My search transcripts module can now actually be installed with pip after cloning, as I think I got the pyproject.toml file working. I also updated the code to support the sorting mentioned above, though if the episode key isn’t castable to an integer, it will treat all the string keys the same (cast ‘something’ as int returns 0).

Other transcript search projects

There is a similar effort for the ATP podcast called CatATP, which does not (yet) use whisper. Similarly, David Smith is reviving his multi-podcast search page using Whisper transcriptions..

UPDATE 2023-Feb-7. Both CatATP and David’s smith page use Whisper now.

TechBeret wrote a nice blog post about the CatATP transition to Whisper.

WebAssembly

This is a wild demo: run whisper.cpp and gpt2 in a browser using WebAssembly. The code all runs in the browser, you can talk into your microphone and it’ll transcribe and GPT2 will respond. Repository link here. Note: I should learn more about WebAssembly.

(I personally think the live transcription is cooler than whatever stochastic parroting GPT-2 produces, but YMMV as they say.)