Kotoba-Whisper-v1.1 Open-source Japanese Speech Recognition Model

Kotoba Whisper V1.1

Developed by kotoba-tech

Kotoba-Whisper-v1.1 is a Japanese automatic speech recognition model based on Whisper, with added punctuation and timestamp processing capabilities.

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese speech recognition #Automatic punctuation addition #Low-latency inference

Downloads 476

Release Time : 4/29/2024

Model Overview

This is a Japanese automatic speech recognition (ASR) model based on the Whisper architecture, specifically optimized for Japanese speech transcription and integrated with punctuation addition and timestamp processing functions.

Model Features

Punctuation processing

Integrated with the punctuators library to automatically add punctuation to transcribed text.

Timestamp processing

Improved timestamp accuracy using the stable-ts library.

Japanese optimization

Specifically optimized for Japanese speech recognition.

Efficient inference

Faster inference speed compared to the original Whisper model.

Model Capabilities

Japanese speech recognition

Automatic punctuation addition

Timestamp generation

Long audio processing

Use Cases

Speech transcription

Meeting transcription

Convert Japanese meeting recordings into punctuated text records.

Accuracy outperforms the original Whisper model.

Podcast transcription

Transcribe Japanese podcast content into text with timestamps.

Supports long audio processing.

Speech analysis

Speech content analysis

Analyze keywords and topics in Japanese speech content.

🚀 Kotoba-Whisper-v1.1

Kotoba-Whisper-v1.1 is a Japanese Automatic Speech Recognition (ASR) model. It builds upon kotoba-tech/kotoba-whisper-v1.0 and integrates additional post - processing stacks as a pipeline. New features include adding punctuation with punctuators. These libraries are merged into Kotoba-Whisper-v1.1 via the pipeline and will be seamlessly applied to the predicted transcription from kotoba-tech/kotoba-whisper-v1.0. The pipeline is developed through the collaboration between Asahi Ushio and Kotoba Technologies.

✨ Features

Based on the kotoba-tech/kotoba-whisper-v1.0 model.
Integrates post - processing stacks as a pipeline, including adding punctuation using punctuators.
Supports transcription of Japanese audio with or without prompts.

📦 Installation

Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

💻 Usage Examples

Basic Usage

The model can be used with the pipeline class to transcribe audio files as follows:

import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)

Advanced Usage

Transcribing a Local Audio File

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)

Deactivating the Punctuator

To deactivate punctuator:

-     punctuator=True,
+     punctuator=False,

Transcription with Prompt

Kotoba-whisper can generate transcription with prompting as below:

import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text']
print(text)
# 81歳、力強い走りに変わってきます。

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91歳"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU allows for it.

📚 Documentation

Evaluation Metrics

The following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script here):

model	CommonVoice 8 (Japanese test set)	JSUT Basic 5000	ReazonSpeech (held out test set)
kotoba-tech/kotoba-whisper-v2.0	17.6	15.4	17.4
kotoba-tech/kotoba-whisper-v2.1	17.7	15.4	17
kotoba-tech/kotoba-whisper-v1.0	17.8	15.2	17.8
kotoba-tech/kotoba-whisper-v1.1	17.9	15	17.8
openai/whisper-large-v3	15.3	13.4	20.5
openai/whisper-large-v2	15.9	10.6	34.6
openai/whisper-large	16.6	11.3	40.7
openai/whisper-medium	17.9	13.1	39.3
openai/whisper-base	34.5	26.4	76
openai/whisper-small	21.5	18.9	48.1
openai/whisper-tiny	58.8	38.3	153.3

Regarding the normalized CER, since those updates from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as kotoba-tech/kotoba-whisper-v1.0.

Latency

Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk, we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on transcribing 50min Japanese speech audio, where we report the average over five independent runs.

model	return_timestamps	time (mean)
kotoba-tech/kotoba-whisper-v1.0	False	10.8
kotoba-tech/kotoba-whisper-v1.0	True	15.7
kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts)	True	17.9
kotoba-tech/kotoba-whisper-v1.1 (punctuator)	True	17.7
kotoba-tech/kotoba-whisper-v1.1 (stable-ts)	True	16.1
openai/whisper-large-v3	False	29.1
openai/whisper-large-v3	True	37.9

See the full table here.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご