đ Whisper Llamafiles
A set of llamafiles generated for the Whisper automatic speech recognition model, offering easy deployment and usage.
⨠Features
- A collection of llamafiles generated for whisper.
- Generated using the whisperfile repo, a fork of the main llamafile repo to support whisper.cpp.
- Quantized llamafiles available for multilingual whisper models in q8 and q5k formats, with the original model also accessible.
đĻ Installation
Prerequisites
To run the Whisper large-v3
model, first install the necessary libraries:
pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
đģ Usage Examples
Running the Llamafile
chmod +x <model>.llamafile
./<model>.llamafile
Using the Model with Pipeline
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Transcribing a Local Audio File
- result = pipe(sample)
+ result = pipe("audio.mp3")
Specifying Language and Task
result = pipe(sample, generate_kwargs={"language": "english", "task": "translate"})
Getting Timestamps
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
đ Documentation
Model Details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2
.
The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation.
Additional Speed & Memory Improvements
Flash Attention
If your GPU allows, use Flash-Attention 2. First, install Flash Attention:
pip install flash-attn --no-build-isolation
Then, pass use_flash_attention_2=True
to from_pretrained
:
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
Torch Scale-Product-Attention (SDPA)
If your GPU doesn't support Flash Attention, use BetterTransformers. First, install optimum:
pip install --upgrade optimum
Then, convert your model to a "BetterTransformer" model:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()
Fine-Tuning
The pre-trained Whisper model can be fine-tuned for better performance on certain languages and tasks. Refer to the blog post Fine-Tune Whisper with đ¤ Transformers for a step-by-step guide.
Evaluated Use
The primary users are AI researchers. However, Whisper can also be useful for developers, especially for English speech recognition. Users should perform robust evaluations before deployment.
Training Data
The models were trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2
.
Performance and Limitations
The models show improved robustness and near state-of-the-art accuracy. However, they may produce hallucinated texts and perform unevenly across languages and accents.
Broader Implications
Whisper models can improve accessibility tools but also raise dual-use concerns related to surveillance.
đ§ Technical Details
Llamafile Parameters
Each llamafile has the following params:
whisperfile -m $filename.bin --host 0.0.0.0 --port 51524 --convert -pc -pr
This starts a server on port 51524, converts audio files to the proper .wav format via ffmpeg, and prints/colorizes the decoded text in the terminal output.
đ License
This project is licensed under the Apache-2.0 license.
BibTeX entry and citation info
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}