Kotoba Whisper V1.1

kotoba-techによって開発

Kotoba-Whisper-v1.1はWhisperをベースにした日本語自動音声認識モデルで、句読点とタイムスタンプ処理機能を追加しています。

音声認識

Transformers

日本語オープンソースライセンス:Apache-2.0 #日本語音声認識 #句読点自動追加 #低遅延推論

ダウンロード数 476

リリース時間 : 4/29/2024

モデル概要

これは日本語自動音声認識（ASR）モデルで、Whisperアーキテクチャを基に、特に日本語音声転写に最適化され、句読点追加とタイムスタンプ処理機能を統合しています。

モデル特徴

句読点処理

punctuatorsライブラリを統合し、転写テキストに自動的に句読点を追加できます。

タイムスタンプ処理

stable-tsライブラリを使用してタイムスタンプの精度を向上させました。

日本語最適化

特に日本語音声認識向けに最適化されています。

効率的な推論

オリジナルのWhisperモデルに比べて推論速度が向上しています。

モデル能力

日本語音声認識

自動句読点追加

タイムスタンプ生成

長音声処理

使用事例

音声転写

会議議事録転写

日本語会議録音を句読点付きテキスト記録に変換します。

オリジナルWhisperモデルより優れた精度

ポッドキャスト転写

日本語ポッドキャストコンテンツをタイムスタンプ付きテキストに転写します。

長音声処理をサポート

音声分析

音声コンテンツ分析

日本語音声コンテンツのキーワードとテーマを分析します。

language: ja library_name: transformers license: apache-2.0 tags:

audio
automatic-speech-recognition
hf-asr-leaderboard widget:
example_title: CommonVoice 8.0 (Test Split) src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac
example_title: JSUT Basic 5000 src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac
example_title: ReazonSpeech (Test Split) src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac pipeline_tag: automatic-speech-recognition datasets:
japanese-asr/whisper_transcriptions.reazonspeech.large
japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0
japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0.vectorized

Kotoba-Whisper-v1.1

Kotoba-Whisper-v1.1 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v1.0, with additional postprocessing stacks integrated as pipeline. The new features includes adding punctuation with punctuators. These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from kotoba-tech/kotoba-whisper-v1.0. The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies

Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script here) along with the.

model	CommonVoice 8 (Japanese test set)	JSUT Basic 5000	ReazonSpeech (held out test set)
kotoba-tech/kotoba-whisper-v2.0	17.6	15.4	17.4
kotoba-tech/kotoba-whisper-v2.1	17.7	15.4	17
kotoba-tech/kotoba-whisper-v1.0	17.8	15.2	17.8
kotoba-tech/kotoba-whisper-v1.1	17.9	15	17.8
openai/whisper-large-v3	15.3	13.4	20.5
openai/whisper-large-v2	15.9	10.6	34.6
openai/whisper-large	16.6	11.3	40.7
openai/whisper-medium	17.9	13.1	39.3
openai/whisper-base	34.5	26.4	76
openai/whisper-small	21.5	18.9	48.1
openai/whisper-tiny	58.8	38.3	153.3

Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as kotoba-tech/kotoba-whisper-v1.0.

Latency

Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk, we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on transcribing 50min Japanese speech audio, where we report the average over five independent runs.

model	return_timestamps	time (mean)
kotoba-tech/kotoba-whisper-v1.0	False	10.8
kotoba-tech/kotoba-whisper-v1.0	True	15.7
kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts)	True	17.9
kotoba-tech/kotoba-whisper-v1.1 (punctuator)	True	17.7
kotoba-tech/kotoba-whisper-v1.1 (stable-ts)	True	16.1
openai/whisper-large-v3	False	29.1
openai/whisper-large-v3	True	37.9

See the full table here.

Transformers Usage

Kotoba-Whisper-v1.1 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

Transcription

The model can be used with the pipeline class to transcribe audio files as follows:

import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True,
    punctuator=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# run inference
result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)

To deactivate punctuator:

-     punctuator=True,
+     punctuator=False,

Transcription with Prompt

Kotoba-whisper can generate transcription with prompting as below:

import re
import torch
from transformers import pipeline
from datasets import load_dataset

# config
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True
)

# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- Without prompt ---
text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text']
print(text)
# 81歳、力強い走りに変わってきます。

# --- With prompt ---: Let's change `81` to `91`.
prompt = "91歳"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}