Distil-whisper-large-v3-es Open-source Spanish Speech Recognition Model - Free to Assist with Accurate Spanish Recognition

Distil Whisper Large V3 Es

Developed by marianbasti

A Spanish speech recognition model distilled from Whisper v3 large model, jointly developed by SandboxAI and Universidad Nacional de Rio Negro

Speech Recognition

Transformers

SpanishOpen Source License:MIT #Spanish speech recognition #Distilled optimized model #Long audio chunk processing

Downloads 64

Release Time : 1/26/2024

Model Overview

This model is optimized for Spanish speech recognition, obtained by distilling Whisper-large-v3, improving inference speed while maintaining high accuracy

Model Features

Efficient Inference

Uses chunking algorithm for long audio processing, 9x faster than original Whisper

Speculative Decoding Support

Can serve as auxiliary model for Whisper's speculative decoding, 2x speedup

Spanish Optimization

Specially trained and optimized for Spanish speech recognition tasks

Model Capabilities

Spanish speech transcription

Long audio processing

Real-time speech recognition

Use Cases

Speech Transcription

Meeting Minutes

Automatically convert Spanish meeting recordings into text transcripts

Media Caption Generation

Automatically generate subtitles for Spanish video content

🚀 distil-whisper-large-v3-es

This repository hosts a distilled version of the Whisper v3 large model, trained on the Mozilla Common Voice dataset v16.1. It's a collaborative effort between SandboxAI and the Universidad Nacional de Rio Negro.

🚀 Quick Start

Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

✨ Features

Short-Form Transcription: Can transcribe short-form audio files (< 30 seconds).
Long-Form Transcription: Uses a chunked algorithm to transcribe long-form audio files (> 30 seconds), which is 9x faster than the sequential algorithm in the Whisper paper.
Speculative Decoding: Can be used as an assistant model for Whisper in speculative decoding, achieving the same outputs as Whisper while being 2 times faster.

💻 Usage Examples

Basic Usage

Short-Form Transcription

The model can be used with the pipeline class to transcribe short-form audio files (< 30 seconds) as follows:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "marianbasti/distil-whisper-large-v3-es"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

- result = pipe(sample)
+ result = pipe("audio.mp3")

Advanced Usage

Long-Form Transcription

Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30 seconds). In practice, this chunked long-form algorithm is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the Distil-Whisper paper).

To enable chunking, pass the chunk_length_s parameter to the pipeline. For Distil-Whisper, a chunk length of 15 seconds is optimal. To activate batching, pass the argument batch_size:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "marianbasti/distil-whisper-large-v3-es"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

Speculative Decoding

Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.

In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then specify it as the "assistant model" for generation:

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
assistant_model_id = "marianbasti/distil-whisper-large-v3-es"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

🔧 Technical Details

The model was trained for 60,000 optimisation steps (or around 1.47 epochs), on a single RTX3090 for ~60 hours, using the following training parameters:

--teacher_model_name_or_path "openai/whisper-large-v3"
--train_dataset_name "mozilla-foundation/common_voice_16_1"
--train_dataset_config_name "es"
--train_split_name "train"
--text_column_name "sentence"
--eval_dataset_name "mozilla-foundation/common_voice_16_1"
--eval_dataset_config_name "es"
--eval_split_name "validation"
--eval_text_column_name "sentence"
--eval_steps 10000
--save_steps 10000
--warmup_steps 500
--learning_rate 1e-4
--lr_scheduler_type "linear"
--logging_steps 25
--save_total_limit 1
--max_steps 60000
--wer_threshold 10
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--dataloader_num_workers 12
--preprocessing_num_workers 12
--output_dir "./"
--do_train
--do_eval
--gradient_checkpointing
--predict_with_generate
--overwrite_output_dir
--use_pseudo_labels "false"
--freeze_encoder
--streaming False

📚 Documentation

The distilled model performs with a 5.11% WER (10.15% orthogonal WER).

📄 License

Distil-Whisper inherits the MIT license from OpenAI's Whisper model.

📚 Citation

If you use this model, please consider citing the Distil-Whisper paper:

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご