Whisperfile Open-source Speech Model - Free Support for Multilingual Speech Recognition and Translation

Whisperfile

Developed by cjpais

Whisper is a Transformer-based encoder-decoder model for speech recognition and translation tasks, supporting multilingual processing.

Speech Recognition Open Source License:Apache-2.0 #Multilingual speech recognition #Zero-shot translation #Long audio chunking processing

Downloads 353

Release Time : 5/17/2024

Model Overview

Whisper is a powerful automatic speech recognition (ASR) system capable of handling speech transcription and translation tasks in multiple languages. It is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio, with excellent robustness and accuracy.

Model Features

Multilingual support

Supports speech recognition and translation in multiple languages, including newly added support for Cantonese

High robustness

Has stronger robustness against accents, background noise, and professional languages

Efficient chunking processing

Uses a chunking algorithm to process long audio, 9 times faster than traditional sequential algorithms

Timestamp support

Can obtain sentence-level and word-level timestamp information

Model Capabilities

Speech recognition

Speech translation

Multilingual processing

Long audio processing

Timestamp generation

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe meeting recordings into text

High-accuracy text transcription

Podcast transcription

Transcribe podcast content into searchable text

Supports multiple languages and accents

Speech translation

Real-time translation

Translate the speech in one language into text in another language in real-time

Translation accuracy close to the current state-of-the-art

Assistive tools

Accessible applications

Provide speech-to-text services for the hearing-impaired

Improve information accessibility

🚀 Whisper Llamafiles

A set of llamafiles generated for the Whisper automatic speech recognition model, offering easy deployment and usage.

✨ Features

A collection of llamafiles generated for whisper.
Generated using the whisperfile repo, a fork of the main llamafile repo to support whisper.cpp.
Quantized llamafiles available for multilingual whisper models in q8 and q5k formats, with the original model also accessible.

📦 Installation

Prerequisites

To run the Whisper large-v3 model, first install the necessary libraries:

pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]

💻 Usage Examples

Running the Llamafile

chmod +x <model>.llamafile
./<model>.llamafile

Using the Model with Pipeline

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Transcribing a Local Audio File

- result = pipe(sample)
+ result = pipe("audio.mp3")

Specifying Language and Task

result = pipe(sample, generate_kwargs={"language": "english", "task": "translate"})

Getting Timestamps

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

📚 Documentation

Model Details

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.

The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation.

Size	Parameters	English-only	Multilingual
tiny	39 M	🔗	🔗
base	74 M	🔗	🔗
small	244 M	🔗	🔗
medium	769 M	🔗	🔗
large	1550 M	x	🔗
large-v2	1550 M	x	🔗
large-v3	1550 M	x	🔗

Additional Speed & Memory Improvements

Flash Attention

If your GPU allows, use Flash-Attention 2. First, install Flash Attention:

pip install flash-attn --no-build-isolation

Then, pass use_flash_attention_2=True to from_pretrained:

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)

Torch Scale-Product-Attention (SDPA)

If your GPU doesn't support Flash Attention, use BetterTransformers. First, install optimum:

pip install --upgrade optimum

Then, convert your model to a "BetterTransformer" model:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()

Fine-Tuning

The pre-trained Whisper model can be fine-tuned for better performance on certain languages and tasks. Refer to the blog post Fine-Tune Whisper with 🤗 Transformers for a step-by-step guide.

Evaluated Use

The primary users are AI researchers. However, Whisper can also be useful for developers, especially for English speech recognition. Users should perform robust evaluations before deployment.

Training Data

The models were trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.

Performance and Limitations

The models show improved robustness and near state-of-the-art accuracy. However, they may produce hallucinated texts and perform unevenly across languages and accents.

Broader Implications

Whisper models can improve accessibility tools but also raise dual-use concerns related to surveillance.

🔧 Technical Details

Llamafile Parameters

Each llamafile has the following params: whisperfile -m $filename.bin --host 0.0.0.0 --port 51524 --convert -pc -pr This starts a server on port 51524, converts audio files to the proper .wav format via ffmpeg, and prints/colorizes the decoded text in the terminal output.

📄 License

This project is licensed under the Apache-2.0 license.

BibTeX entry and citation info

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご