Whisper-small Open-source Speech Model - Free to Use for Automatic Speech Recognition and Translation

Whisper Small

Developed by unsloth

Whisper is a pre-trained automatic speech recognition (ASR) and speech translation model, trained on 680,000 hours of annotated data with strong generalization capabilities.

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Speech Recognition #Zero-shot Translation #Long Audio Chunk Processing

Downloads 50

Release Time : 5/14/2025

Model Overview

A Transformer-based encoder-decoder model that supports multilingual speech recognition and translation tasks, capable of adapting to various datasets and domains without fine-tuning.

Model Features

Large-scale Weakly Supervised Training

Trained on 680,000 hours of diverse speech data covering multiple languages and accents

Zero-shot Transfer Capability

Performs well on new languages and domains without fine-tuning

Multi-task Unified Architecture

Single model supports both speech recognition and translation tasks

Long Audio Processing

Supports transcription of audio of any length through chunking algorithms

Model Capabilities

Speech-to-text

Cross-language speech translation

Multilingual recognition

Timestamped transcription

Use Cases

Speech Transcription

Automated Meeting Minutes

Convert meeting recordings into text transcripts in real-time

English test set WER 3.43% (LibriSpeech clean)

Podcast Subtitle Generation

Create multilingual subtitles for non-English podcasts

Speech Translation

Real-time Speech Translation

Translate languages like French into English text in real-time

Examples demonstrate smooth cross-language conversion

🚀 Whisper

Whisper is a pre - trained model for automatic speech recognition (ASR) and speech translation, trained on a large amount of labeled data to offer strong generalization ability without fine - tuning.

🚀 Quick Start

Whisper is a powerful pre - trained model for automatic speech recognition and speech translation. It can handle both English - only and multilingual tasks, and comes in different sizes to meet various needs.

✨ Features

Multilingual Support: Supports a wide range of languages including English, Chinese, German, Spanish, etc.
Dual - Task Training: Trained for both speech recognition and speech translation.
Multiple Model Sizes: Available in five different configurations with varying model sizes.

📚 Documentation

Model Details

Whisper is a Transformer - based encoder - decoder model, trained on 680k hours of labeled speech data using large - scale weak supervision.

The models are trained on either English - only or multilingual data. English - only models focus on speech recognition, while multilingual models handle both speech recognition and translation.

Whisper checkpoints come in five sizes: tiny, base, small, medium, and large. The smallest four have English - only and multilingual versions, and the largest are multilingual only. All pre - trained checkpoints are available on the Hugging Face Hub.

Size	Parameters	English - only	Multilingual
tiny	39 M	[🔗](https://huggingface.co/openai/whisper - tiny.en)	[🔗](https://huggingface.co/openai/whisper - tiny)
base	74 M	[🔗](https://huggingface.co/openai/whisper - base.en)	[🔗](https://huggingface.co/openai/whisper - base)
small	244 M	[🔗](https://huggingface.co/openai/whisper - small.en)	[🔗](https://huggingface.co/openai/whisper - small)
medium	769 M	[🔗](https://huggingface.co/openai/whisper - medium.en)	[🔗](https://huggingface.co/openai/whisper - medium)
large	1550 M	x	[🔗](https://huggingface.co/openai/whisper - large)
large - v2	1550 M	x	[🔗](https://huggingface.co/openai/whisper - large - v2)

Usage

To transcribe audio samples, use the model with a WhisperProcessor. The WhisperProcessor pre - processes audio inputs and post - processes model outputs.

Context tokens are used to inform the model of the task (transcription or translation). A typical sequence of context tokens might be:

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

These tokens can be forced or un - forced. Forced tokens control the output language and task, while un - forced tokens let the model predict them automatically.

model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")

💻 Usage Examples

Basic Usage

Transcription - English to English

In this example, the context tokens are 'un - forced', and the model automatically predicts the output language (English) and task (transcribe).

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> model.config.forced_decoder_ids = None

>>> # load dummy dataset and read audio files
>>> ds = load_dataset("hf - internal - testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # generate token ids
>>> predicted_ids = model.generate(input_features)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

French to French

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids)
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail int√©ressant va enfin √™tre men√© sur ce sujet.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Un vrai travail int√©ressant va enfin √™tre men√© sur ce sujet.']

Advanced Usage

Translation - French to English

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']

Evaluation

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper - small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper - small").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
3.432213777886737

Long - Form Transcription

The Whisper model is designed for audio samples up to 30s. Using a chunking algorithm via the Transformers pipeline method, it can transcribe audio of arbitrary length. Enable chunking by setting chunk_length_s = 30 when instantiating the pipeline.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご