🚀 Kotoba-Whisper-v1.1
Kotoba-Whisper-v1.1 is a Japanese Automatic Speech Recognition (ASR) model. It builds upon kotoba-tech/kotoba-whisper-v1.0 and integrates additional post - processing stacks as a pipeline
. New features include adding punctuation with punctuators. These libraries are merged into Kotoba-Whisper-v1.1 via the pipeline and will be seamlessly applied to the predicted transcription from kotoba-tech/kotoba-whisper-v1.0. The pipeline is developed through the collaboration between Asahi Ushio and Kotoba Technologies.
✨ Features
- Based on the kotoba-tech/kotoba-whisper-v1.0 model.
- Integrates post - processing stacks as a pipeline, including adding punctuation using
punctuators
.
- Supports transcription of Japanese audio with or without prompts.
📦 Installation
Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5
💻 Usage Examples
Basic Usage
The model can be used with the pipeline
class to transcribe audio files as follows:
import torch
from transformers import pipeline
from datasets import load_dataset
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
batch_size=16,
trust_remote_code=True,
punctuator=True
)
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]
result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
Advanced Usage
Transcribing a Local Audio File
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
Deactivating the Punctuator
To deactivate punctuator:
- punctuator=True,
+ punctuator=False,
Transcription with Prompt
Kotoba-whisper can generate transcription with prompting as below:
import re
import torch
from transformers import pipeline
from datasets import load_dataset
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
batch_size=16,
trust_remote_code=True
)
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text']
print(text)
prompt = "91歳"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
Flash Attention 2
We recommend using Flash-Attention 2 if your GPU allows for it.
📚 Documentation
Evaluation Metrics
The following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script here):
Regarding the normalized CER, since those updates from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as kotoba-tech/kotoba-whisper-v1.0.
Latency
Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk, we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on transcribing 50min Japanese speech audio, where we report the average over five independent runs.
model |
return_timestamps |
time (mean) |
kotoba-tech/kotoba-whisper-v1.0 |
False |
10.8 |
kotoba-tech/kotoba-whisper-v1.0 |
True |
15.7 |
kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) |
True |
17.9 |
kotoba-tech/kotoba-whisper-v1.1 (punctuator) |
True |
17.7 |
kotoba-tech/kotoba-whisper-v1.1 (stable-ts) |
True |
16.1 |
openai/whisper-large-v3 |
False |
29.1 |
openai/whisper-large-v3 |
True |
37.9 |
See the full table here.
📄 License
This project is licensed under the Apache-2.0 license.