Kotoba-Whisper-v2.1 Open-source Japanese Speech Recognition Model - Automatically Add Punctuation and Accurately Recognize Speech

Kotoba Whisper V2.1

Developed by kotoba-tech

Kotoba-Whisper-v2.1 is a Japanese automatic speech recognition (ASR) model based on Whisper, integrating an additional post-processing stack that automatically adds punctuation marks.

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Speech Recognition #Automatic Punctuation Addition #Low-latency Inference

Downloads 2,589

Release Time : 9/17/2024

Model Overview

This model focuses on Japanese speech recognition tasks, achieving automatic punctuation addition through the integration of the punctuators library, thereby enhancing the readability of transcribed text.

Model Features

Automatic Punctuation Addition

By integrating the punctuators library, the model can automatically add punctuation marks to transcribed text, improving readability.

Optimized Japanese Recognition

Specially optimized for Japanese speech recognition, it performs excellently on multiple Japanese datasets.

Pipeline Integration

The post-processing stack is seamlessly integrated through a pipeline, simplifying the usage process.

Model Capabilities

Japanese Speech Recognition

Automatic Punctuation Addition

Batch Audio Processing

Use Cases

Speech Transcription

Meeting Minutes Transcription

Convert Japanese meeting recordings into punctuated text transcripts

CER 17.7 (CommonVoice 8 Test Set)

Media Content Subtitle Generation

Automatically generate punctuated subtitles for Japanese video content

CER 15.4 (JSUT Basic 5000 Dataset)

🚀 Kotoba-Whisper-v2.1

Kotoba-Whisper-v2.1 is a Japanese Automatic Speech Recognition (ASR) model. It builds upon kotoba-tech/kotoba-whisper-v2.0 and integrates additional post - processing stacks as a pipeline. New features include adding punctuation with punctuators. These libraries are merged into Kotoba-Whisper-v2.1 via the pipeline and will be seamlessly applied to the predicted transcription from kotoba-tech/kotoba-whisper-v2.0. The pipeline was developed through the collaboration between Asahi Ushio and Kotoba Technologies.

🚀 Quick Start

Kotoba-Whisper-v2.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

✨ Features

Post - processing Integration: Integrates additional post - processing stacks as a pipeline, including punctuation addition.
Multiple Dataset Support: Supports multiple Japanese speech datasets for evaluation and testing.

📦 Installation

To install the necessary dependencies for running the model, use the following commands:

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

💻 Usage Examples

Basic Usage

The model can be used with the pipeline class to transcribe audio files as follows:

import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v2.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)

Advanced Usage

Transcribe a Local Audio File

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)

Deactivate Punctuator

To deactivate the punctuator:

-     punctuator=True,
+     punctuator=False,

Use Flash Attention 2

We recommend using Flash - Attention 2 if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao - AILab/flash - attention):

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

📚 Documentation

Model Evaluation

The following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script here):

model	CommonVoice 8 (Japanese test set)	JSUT Basic 5000	ReazonSpeech (held out test set)
kotoba-tech/kotoba-whisper-v2.0	17.6	15.4	17.4
kotoba-tech/kotoba-whisper-v2.1	17.7	15.4	17
kotoba-tech/kotoba-whisper-v1.0	17.8	15.2	17.8
kotoba-tech/kotoba-whisper-v1.1	17.9	15	17.8
openai/whisper-large-v3	15.3	13.4	20.5
openai/whisper-large-v2	15.9	10.6	34.6
openai/whisper-large	16.6	11.3	40.7
openai/whisper-medium	17.9	13.1	39.3
openai/whisper-base	34.5	26.4	76
openai/whisper-small	21.5	18.9	48.1
openai/whisper-tiny	58.8	38.3	153.3

Regarding the normalized CER, since those updates from v2.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v2.1 marks the same CER values as kotoba-tech/kotoba-whisper-v2.0.

Latency

Please refer to the section of the latency in the kotoba-whisper-v1.1 here.

📄 License

This project is licensed under the Apache 2.0 license.

🔧 Technical Details

The model is based on the Whisper architecture and is fine - tuned on Japanese speech datasets. It integrates post - processing libraries for punctuation addition and supports Flash Attention 2 for improved performance on compatible GPUs.

Acknowledgements

OpenAI for the Whisper model.
Hugging Face 🤗 Transformers for the model integration.
Hugging Face 🤗 for the Distil - Whisper codebase.
Reazon Human Interaction Lab for the ReazonSpeech dataset.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご