đ Whisper Large-v3: Advanced Automatic Speech Recognition
This project offers an advanced automatic speech recognition solution based on the Whisper large-v3 model. It supports a wide range of languages and provides efficient and accurate speech transcription and translation capabilities.
đ Quick Start
⨠Features
- Multilingual Support: Supports a vast array of languages including en, zh, de, es, ru, etc.
- High Performance: Demonstrates improved performance over a wide variety of languages, with 10% - 20% reduction of errors compared to Whisper large-v2.
- Efficient Training: Trained on >5M hours of labeled data, enabling strong generalization in zero-shot settings.
- Flexible Usage: Can be used for speech transcription and translation tasks, and supports various decoding strategies.
đĻ Installation
To run the Whisper large-v3 model, first install the necessary libraries:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Advanced Usage
Transcribe a Local Audio File
result = pipe("audio.mp3")
Transcribe Multiple Audio Files in Parallel
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
Enable Decoding Strategies
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35,
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
Specify Source Audio Language
result = pipe(sample, generate_kwargs={"language": "english"})
Perform Speech Translation
result = pipe(sample, generate_kwargs={"task": "translate"})
Predict Timestamps
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
Using Model + Processor API Directly
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35,
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
đ§ Technical Details
Model Architecture
Whisper large-v3 has the same architecture as the previous large and large-v2 models, with the following minor differences:
- The spectrogram input uses 128 Mel frequency bins instead of 80.
- A new language token for Cantonese.
Training Data
The model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. It was trained for 2.0 epochs over this mixture dataset.
Performance Improvement
The large-v3 model shows improved performance over a wide variety of languages, with 10% to 20% reduction of errors compared to Whisper large-v2.
đ License
This project is licensed under the apache-2.0 license.
Additional Information
Supported Models Comparison
Additional Speed & Memory Improvements
Chunked Long-Form
You can enable the chunked long-form algorithm to transcribe long audio files more efficiently. Pass the chunk_length_s
parameter to the pipeline
and set the batch_size
for batching.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Torch compile
The Whisper forward pass is compatible with torch.compile
for 4.5x speed-ups.
â ī¸ Important Note
torch.compile
is currently not fully stable and may have some compatibility issues.
Model Widgets
See Our Collection
- All TTS Models: Check out our collection for all our TTS model uploads.
Learn to Fine-tune
Unsloth Dynamic 2.0
- Superior Performance: Unsloth Dynamic 2.0 achieves superior accuracy and outperforms other leading quants.
Connect with Us