Whisper-large-v3-turbo Open-source Speech Recognition Model - Free Deployment, High-speed Speech-to-Text Experience

Whisper Large V3 Turbo

Developed by deepdml

Whisper large-v3-turbo is a distilled version of OpenAI Whisper large-v3, with the decoder layers reduced from 32 to 4, significantly improving speed while slightly reducing quality.

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual speech recognition #Real-time speech transcription #Long audio processing

Downloads 883

Release Time : 10/1/2024

Model Overview

Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, with strong generalization capabilities.

Model Features

Efficient distilled version

Decoder layers reduced from 32 to 4, significantly improving speed while slightly reducing quality

Multilingual support

Supports speech recognition and translation in 96 languages

Large-scale training

Trained on over 5 million hours of labeled data

Zero-shot generalization

Demonstrates strong zero-shot generalization across diverse datasets and domains

Model Capabilities

Speech-to-text

Speech translation

Multilingual recognition

Long audio processing

Timestamp prediction

Use Cases

Speech transcription

Meeting minutes

Automatically convert meeting recordings into text transcripts

Highly accurate transcriptions

Podcast transcription

Convert podcast content into searchable text

Supports transcription in multiple languages

Speech translation

Real-time translation

Translate foreign language speech into English text in real-time

Fluid translation results

🚀 Whisper

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation. It was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on over 5 million hours of labeled data, Whisper shows strong generalization ability across various datasets and domains in a zero-shot setting.

Whisper large-v3-turbo is a distilled version of Whisper large-v3. That is, it's essentially the same model, but the number of decoding layers has been reduced from 32 to 4. As a result, the model is much faster, though there's a slight degradation in quality.

Disclaimer: Part of the content for this model card was written by the 🤗 Hugging Face team, and part was copied from the original model card.

🚀 Quick Start

Prerequisites

Whisper large-v3-turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

Basic Usage

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "deepdml/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

✨ Features

Multilingual Support: Supports multiple languages including en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, 'no', th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su.
Automatic Language Prediction: Automatically predicts the language of the source audio.
Speech Translation: Can perform speech translation tasks, converting speech to English text.
Timestamp Prediction: Can predict both sentence-level and word-level timestamps.

📦 Installation

To install the necessary libraries for using Whisper large-v3-turbo, run the following commands:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "deepdml/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Transcribing a Local Audio File

result = pipe("audio.mp3")

Transcribing Multiple Audio Files in Parallel

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Using Decoding Strategies

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Specifying the Source Audio Language

result = pipe(sample, generate_kwargs={"language": "english"})

Performing Speech Translation

result = pipe(sample, generate_kwargs={"task": "translate"})

Predicting Sentence-Level Timestamps

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

Predicting Word-Level Timestamps

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

Combining Arguments

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

Using the Model + Processor API Directly

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "deepdml/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

🔧 Technical Details

Long-Form Transcription Algorithms

Whisper has a receptive field of 30 seconds. To transcribe audios longer than this, one of two long-form algorithms is required:

Sequential: Uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other.
Chunked: Splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries.

The sequential long-form algorithm should be used in either of the following scenarios:

Transcription accuracy is the most important factor, and speed is less of a consideration.
You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate.

Conversely, the chunked algorithm should be used when:

Transcription speed is the most important factor.
You are transcribing a single long audio file.

Additional Speed & Memory Improvements

Chunked Long-Form

To enable the chunked algorithm, pass the chunk_length_s parameter to the pipeline. For large-v3, a chunk length of 30 seconds is optimal. To activate batching over long audio files, pass the argument batch_size:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "deepdml/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch compile

The Whisper forward pass is compatible with torch.compile for 4.5x speed-ups.

Note: torch.compile is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "deepdml/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())

print(result["text"])

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention:

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Torch Scale-Product-Attention (SDPA)

If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet:

from transformers.utils import is_torch_sdpa_available

print(is_torch_sdpa_available())

If the above returns True, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns False, you need to upgrade your PyTorch version according to the official instructions

Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying attn_implementation="sdpa" as follows:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa")

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご