Kotoba Whisper V2.2

Developed by kotoba-tech

Japanese automatic speech recognition model based on Whisper, integrating speaker separation and punctuation addition functions

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese speech recognition #Speaker diarization #Punctuation restoration

Downloads 22.80k

Release Time : 10/18/2024

Model Overview

Kotoba-Whisper-v2.2 is a Japanese automatic speech recognition (ASR) model developed based on the Whisper architecture, with added post-processing capabilities for speaker separation and punctuation insertion.

Model Features

Speaker diarization

Incorporates diarizers technology to identify and separate speech content from different speakers

Automatic punctuation

Uses punctuators technology to automatically add punctuation to transcribed text

Efficient inference

Supports Flash Attention 2 acceleration to improve GPU inference efficiency

Model Capabilities

Japanese speech recognition

Multi-speaker separation

Automatic punctuation insertion

Long audio processing

Use Cases

Meeting minutes

Multi-speaker meeting transcription

Automatically identifies speech content from different speakers in meetings and generates punctuated text records

Can distinguish between different speakers and generate formatted meeting minutes

Interview records

Interview transcription

Converts interview recordings into text, automatically distinguishing between interviewer and interviewee speech

Generates interview records with speaker identification and punctuation

language: ja library_name: transformers license: apache-2.0 pipeline_tag: automatic-speech-recognition tags:

audio
automatic-speech-recognition
hf-asr-leaderboard widget:
example_title: Sample 1 src: https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3

Kotoba-Whisper-v2.2

Kotoba-Whisper-v2.2 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v2.0, with additional postprocessing stacks integrated as pipeline. The new features includes (i) speaker diarization with diarizers and (ii) adding punctuation with punctuators. The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies

Transformers Usage

Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git

To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:

And subsequently use a Hugging Face authentication token to log in with:

huggingface-cli login

Transcription with Diarization

The model can be used with the pipeline.

Download an audio sample.

wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3

Run the model via pipeline.

import torch
from transformers import pipeline

# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}


# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=8,
    trust_remote_code=True,
)

# run inference
result = pipe("sample_diarization_japanese.mp3", chunk_length_s=15)
print(result)
>>> {
 'chunks/SPEAKER_00': [{'speaker_id': 'SPEAKER_00', 'text': '水をマレーシアから買わなくてはならないのです', 'timestamp': [22.1, 24.97]}],
 'chunks/SPEAKER_01': [{'speaker_id': 'SPEAKER_01', 'text': 'これも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども', 'timestamp': [0.03, 13.85]},
                      {'speaker_id': 'SPEAKER_01', 'text': '今は屋外の気温', 'timestamp': [5.03, 18.85]},
                      {'speaker_id': 'SPEAKER_01', 'text': '昼も夜も上がってますので', 'timestamp': [7.63, 21.45]},
                      {'speaker_id': 'SPEAKER_01', 'text': '空気の入れ替えだけではかえって人が上がってきます', 'timestamp': [9.91, 23.73]}],
 'chunks/SPEAKER_02': [{'speaker_id': 'SPEAKER_02', 'text': '愚直にやっぱりその街の良さをアピールしていくという', 'timestamp': [13.48, 22.1]},
                      {'speaker_id': 'SPEAKER_02', 'text': 'そういう姿勢が基本にあった上での', 'timestamp': [17.26, 25.88]},
                      {'speaker_id': 'SPEAKER_02', 'text': 'こういうPR作戦だと思うんですよね', 'timestamp': [19.86, 28.48]}],
 'chunks': [{'speaker_id': 'SPEAKER_00', 'text': '水をマレーシアから買わなくてはならないのです', 'timestamp': [22.1, 24.97]},
            {'speaker_id': 'SPEAKER_01', 'text': 'これも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども', 'timestamp': [0.03, 13.85]},
            {'speaker_id': 'SPEAKER_01', 'text': '今は屋外の気温', 'timestamp': [5.03, 18.85]},
            {'speaker_id': 'SPEAKER_01', 'text': '昼も夜も上がってますので', 'timestamp': [7.63, 21.45]},
            {'speaker_id': 'SPEAKER_01', 'text': '空気の入れ替えだけではかえって人が上がってきます', 'timestamp': [9.91, 23.73]},
            {'speaker_id': 'SPEAKER_02', 'text': '愚直にやっぱりその街の良さをアピールしていくという', 'timestamp': [13.48, 22.1]},
            {'speaker_id': 'SPEAKER_02', 'text': 'そういう姿勢が基本にあった上での', 'timestamp': [17.26, 25.88]},
            {'speaker_id': 'SPEAKER_02', 'text': 'こういうPR作戦だと思うんですよね', 'timestamp': [19.86, 28.48]}],
 'speaker_ids': ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02'],
 'text/SPEAKER_00': '水をマレーシアから買わなくてはならないのです',
 'text/SPEAKER_01': 'これも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきます',
 'text/SPEAKER_02': '愚直にやっぱりその街の良さをアピールしていくというそういう姿勢が基本にあった上でのこういうPR作戦だと思うんですよね'
}

To activate punctuator:

-     result = pipe("sample_diarization_japanese.mp3")
+     result = pipe("sample_diarization_japanese.mp3", add_punctuation=True)

The punctuator will be applied to text/* feature. Eg.)

'text/SPEAKER_00': '水をマレーシアから買わなくてはならないのです。'
'text/SPEAKER_01': 'これも先ほどがずっと言っている。自分の感覚的には大丈夫です。けれども。今は屋外の気温、昼も夜も上がってますので、空気の入れ替えだけではかえって人が上がってきます。'
'text/SPEAKER_02': '愚直にその街の良さをアピールしていくという。そういう姿勢が基本にあった上での、こういうPR作戦だと思うんですよね。'

To contorol the number of speakers (see here):

-     result = pipe("sample_diarization_japanese.mp3")
+     result = pipe("sample_diarization_japanese.mp3", num_speakers=3)

-     result = pipe("sample_diarization_japanese.mp3")
+     result = pipe("sample_diarization_japanese.mp3", min_speakers=2, max_speakers=5)

To add silence before/after the audio sometimes improves the transcription quality:

-     result = pipe("sample_diarization_japanese.mp3")
+     result = pipe("sample_diarization_japanese.mp3", add_silence_end=0.5, add_silence_start=0.5)  # adding 0.5 sec silence to before/after the audio

Flash Attention 2

We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

Then pass attn_implementation="flash_attention_2" to from_pretrained:

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

Acknowledgements

OpenAI for the Whisper model.
Hugging Face 🤗 Transformers for the model integration.
Hugging Face 🤗 for the Distil-Whisper codebase.
Reazon Human Interaction Lab for the ReazonSpeech dataset.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご