Anime-Whisper Open-Source Japanese Speech Recognition Model - Accurately Identify Lines from Japanese Anime Performances

Home

Anime Whisper

Developed by litagin

A Japanese speech recognition model specialized in anime-style performance dialogue

Speech Recognition

Transformers

JapaneseOpen Source License:MIT #Anime Voice Recognition #Non-verbal Sound Capture #Script-level Transcription

Downloads 4,873

Release Time : 11/10/2024

Model Overview

Fine-tuned from kotoba-whisper-v2.0, this Japanese ASR model is optimized for anime-style speech, excelling particularly in processing non-verbal sounds and emotional expressions

Model Features

Reduced Hallucination

Significantly decreases erroneous content generation compared to similar models

Non-verbal Sound Recognition

Accurately captures pauses, laughter, shouts, breaths and other non-verbal sounds

Emotional Punctuation Generation

Punctuation naturally follows speech rhythm and emotion, achieving script-level text fluency

Anime Voice Optimization

Exceptionally high accuracy in recognizing anime-style performance dialogue

NSFW Content Processing

Specialized capability to transcribe adult-oriented audio that other models struggle with

Model Capabilities

Japanese speech recognition

Anime-style voice transcription

Non-verbal sound recognition

Emotional text generation

Use Cases

Anime Production

Anime Dubbing Transcription

Convert anime voiceovers into script-formatted text

Approximately 20% higher accuracy than general-purpose models

Game Development

Visual Novel Dialogue Transcription

Automatically transcribe dialogue in Galgame content

Average CER (Character Error Rate) of 13.0%

🚀 Anime Whisper 🤗🎤📝

Anime Whisper is a Japanese speech recognition model specifically fine - tuned for the domain of anime - style acting voices, offering high - performance and unique features.

Anime Whisper is a Japanese speech recognition model specialized in the domain of anime - style acting voices, especially for Japanese. This model is fine - tuned from the [kotoba - whisper - v2.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) base model using approximately 5,300 hours and 3.73 million files of anime - style voice and script datasets, such as Galgame_Speech_ASR_16kHz. Although it is specialized in the anime acting voice domain, it also demonstrates features and high performance not found in other models for other types of voices.

You can try the demo here: https://huggingface.co/spaces/litagin/anime - whisper - demo

🚀 Quick Start

✨ Features

Anime Whisper generally has the following tendencies compared to other models:

Fewer hallucinations: It produces fewer hallucinations during the speech - to - text process.
Faithful transcription: It can accurately transcribe non - linguistic utterances such as stutters, laughter, shouts, and breaths that are often skipped by other models.
Appropriate punctuation: Punctuations like 。、!?… are added appropriately according to the rhythm and emotion of the speech, resulting in a natural - sounding script.
High accuracy for anime voices: It shows particularly high accuracy for anime - style acting voices.
Lightweight and fast: Based on [kotoba - whisper](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) (a distilled model of [whisper - large - v3](https://huggingface.co/openai/whisper - large - v3)), it is lightweight and fast.
NSFW voice transcription: It can transcribe NSFW voices in a proper style, which is almost impossible for other models.

📦 Installation

There is no specific installation content provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

generate_kwargs = {
    "language": "Japanese",
    "no_repeat_ngram_size": 0,
    "repetition_penalty": 1.0,
}
pipe = pipeline(
    "automatic - speech - recognition",
    model="litagin/anime - whisper",
    device="cuda",
    torch_dtype=torch.float16,
    chunk_length_s=30.0,
    batch_size=64,
)

audio_path = "test.wav"
result = pipe(audio_path, generate_kwargs=generate_kwargs)
print(result["text"])

Multiple files inference: If you want to infer multiple files at once, simply pass a list of file paths to pipe.
Suppressing hallucinations: If repeated hallucinations are noticeable, you can set no_repeat_ngram_size: int to around 5 - 10 or set repetition_penalty to a value greater than 1 to suppress them.

📚 Documentation

Evaluation 📊

Detailed evaluation reports, observation reports, and evaluation codes will be published on the [GitHub repository](https://github.com/litagin02/anime - whisper).

CER (Character Error Rate)

Evaluation data: Evaluated on 5 personal novel games (approximately 75k files in total) that are in the same anime - style dialogue domain as the training data but not included in the training data.
Generation parameters: Generated with the no_repeat_ngram_size = 5 parameter to suppress repeated hallucinations in OpenAI's Whisper series.
CER calculation: CER is calculated based on the appropriately normalized results.

figs/cer_ngram5.png

Table

Model Name	game1	game2	game3	game4	game5	avg
[openai/whisper - large](https://huggingface.co/openai/whisper - large)	15.11	20.24	14.89	17.95	19.37	17.5
[openai/whisper - large - v2](https://huggingface.co/openai/whisper - large - v2)	15.11	20.12	14.83	17.65	18.59	17.3
[openai/whisper - large - v3](https://huggingface.co/openai/whisper - large - v3)	14.60	18.66	14.43	17.29	17.74	16.5
[openai/whisper - large - v3 - turbo](https://huggingface.co/openai/whisper - large - v3 - turbo)	15.18	19.24	14.43	17.38	18.15	16.9
[reazon - research/reazonspeech - nemo - v2](https://huggingface.co/reazon - research/reazonspeech - nemo - v2)	23.92	25.08	20.29	25.91	22.71	23.6
[nvidia/parakeet - tdt_ctc - 0.6b - ja](https://huggingface.co/nvidia/parakeet - tdt_ctc - 0.6b - ja)	17.67	20.44	15.33	19.60	19.86	18.6
[kotoba - tech/kotoba - whisper - v1.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v1.0)	16.62	21.54	16.42	19.83	20.01	18.9
[kotoba - tech/kotoba - whisper - v2.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0)	16.38	21.51	16.51	19.69	20.04	18.8
Anime Whisper	11.32	16.52	11.16	12.78	13.23	13.0

Bias and Other Issues 🚨

Proper nouns: Proper nouns such as personal names in the visual novels of the training data are often transcribed in the Chinese characters used in the game.
Specific words: Some specific words in the dataset may be transcribed differently from the norm (e.g., からだ → 身体 and other proper nouns).
Normalization effects: Due to dataset normalization, the following rarely appear in the output:
- Consecutive vowels or long vowels: ああああーーーー
- Consecutive exclamation marks: こらーっ!!!! なにそれ!?!?!?!?
- Consecutive ellipses: …… (usually only one … is output instead of the correct two …… in Japanese notation).
Number, alphabet, and exclamation mark: Numbers, alphabets, and exclamation marks are transcribed in half - width characters.
Ending punctuation: The ending 。 is almost always omitted.
Vulgar language: Transcriptions of some vulgar language may contain censored characters like ○.

Examples 👀

This is a comparison of transcriptions of dialogue from a novel game that is not included in the training data (generated with no_repeat_ngram_size = 5 as well).

The results show that Anime Whisper generally performs as well as whisper - large - v3. The following are some examples highlighting the significant differences from other models, especially for non - linguistic utterances and emotional voices.

Ground Truth Text	Anime Whisper	whisper - large - v3	kotoba - whisper - v2.0	reazonspeech - nemo
あわわわっ！わわわわっ！	はわわっ、わわわわっ…!	ああああああああああ	うわうわ	うわ!
そっ、そっか……。………。……そうなんだ。	そっ…そっか…そうなんだ…	そっか…そうなんだ…	そっか…そうなんだ	そっそっかあっそうなんだ。
たぶん、ぼくが勝つ、はず	たぶん、ボクが勝つ、はず	多分、僕が勝つはず。	多分僕が勝つはず	僕が勝つはず。
げ、げほっ……なんだこいつ！	げほっ、げほっ…なんだ、こいつ…	なんだ、こいつ…	なんだこいつ	フッ何だこいつ。
はっ、はい。そうです。……その、えっと。へっ、変だったでしょうか？	は、はい、そうです…その、えと…へ、変だったでしょうか…?	あ、はい、そうです。そ、えっと、へ、変だったでしょうか。	はいそうですそういと変だったでしょうか	あっはいそうですうすえっとへ変だったでしょうか?
ぶぶぶぶ豚クソがァァァ！待てコルァァァ！	ぶぶぶぶぶ、ぶたくそがー!待てごらぁぁ!	待てこらー	待てこそか	待てこら!
地面が揺れるとかありえ……ぎゃっ！	地面が揺れるとかありえ…ひゃっ!?	地面が揺れるとかありえ?	地面が揺れるとかありえ	やっ!
きゃっほう！い、いたっ、いただきまーす！	きゃっほう!い、いた、いただきまーす!	キャッホー!い、いただきます!	キャホー!いただきます!	いいたいただきます!
……っ、はぁ……わ、わたし、今日は……	んっ、はぁ…わ、私、今日は…	私、今日は…	私今日は	えっと私今日。
……ぷふっ、ンッ。かっ、かっ、かっ……ぷふっ。かっ。んふふっ。かっ、価値観	うふふっ…か、かはっ…ぷっ…はぁっ…か、価値観っ…	価値観!	価値観	ハッかちかん!
か、痒くもねぇ……こんなんんん……！	か、痒くもねえ…こんな、んんっ…!	か、回復もねぇ、こんな、うぬぅ	かかゆくもねえこんな	かゆくもねえこんなうう。
ひゃっ！や、やだ、くすぐった……や、やっ、あは、あははっ	ひゃうっ!やっ、やだっ…くすぐったっ…やっ、やっ、はんっ、あははっ!	やだ!すぐだ!	やだ	やっほ!
ふえぇ、急に止まらないでよう……	ふえぇ、急に止まらないでよぉ	おへぇ、急に止まらないでよ	おへえ急に止まらないでよ	急に止まらないでよ。
ごごご５０キロもないです私ー！	ごごご50キロもないです私ー!	50キロもないです私!	550キロもないです私	50キロもないですわたし!
いいい、すびばぜん、すびばぜーんっ	いいずびばぜんずびばぜーん!	いいいい! ズビバル10! ズビブル10!	いいズビバーテン!	すみませんすみません。
間抜けか貴様ァァァ！	間抜けか貴様ぁぁっ!	マヌケカキ様!	まぬけかきさま	抜けか貴様!
ぷ、くく……ひっ、ひいっ……	くっ…くくくっ…ぷっ…くくっ…	ご視聴ありがとうございました	フッ	フフフフ。フフフフフ。
キミは……。あっ、はっ……。最初から……あんっ、あっ、容赦がないな	君はぁ…はぁっ、はぁっ…最初から…あんっ、あっ、容赦がないなぁ…	君は……最初から容赦がないな	君は最初からあんあ容赦がないな	君は最初からうっうん容赦がないなあ。
望んでるわけ……。のっ、のっ、のっ……望んでるんです。世界が終わればいいって……強く、強くっ。はぁっ、はぁっ	望んでるわけ…の、の、の…望んでるんです…世界が終わればいいって、強く、強く…はぁっ	望んでるわけ…望んでるんです…世界が終わればいいって…強く…強く…	望んでるわけ…ののぞんでるんです世界が終わればいいって強く強く	ん?望んでるんです。世界が終わればいいって強く強く。

NSFW Examples 🫣

Please note that these examples contain adult - oriented expressions.

Panting Sounds

Ground Truth Text	Anime Whisper	whisper - large - v3	kotoba - whisper - v2.0	reazonspeech - nemo
ひっ、あっ！あぅっ、ああぁぁあぁぁぁぁぁっ！はっ、はっ、はっ、はっ、ひぁっ！	んぁっ、あっ、あっ、ああぁぁっ!あっ、はぁっ、はぁっ…んっ、ふぁああっ!	ご視聴ありがとうございました	アハハハ	うわ!
ち、ちがっ……んっ、あぁぁ……気持ちいい、わけが……あぁっ、やぁっ、待てと……んんっ、はぁ……あふぅっ……	ち、ちがっ…はぁっ、はぁっ、気持ちいい、わけがっ…あっ、やぁっ、待てとっ…んくっ、はぁ、はぁっ…	ち、ちが…気持ちいいわけが…待てと…	ちちが気持ちいいわけが待てと	ち違うはあ気持ちいいわけが待てとあっ。
あんっ！あっ、あっ……そっ、それ……あっ、はぁはぁはぁ。ンンンンッ！ぴっ、ぴりぴり、ってして……。あんっ！はぁはぁはぁ、きっ、きもち……いいです！	ふぁんっ!あっ、あぁっ!そっ、それっ…あっ、はぁっ、はぁっ…んんっ!ぴ、ぴりぴりって、して…ひぁっ!はっ、はぁ、はぁっ…!き、気持ち、いいですっ…!	それ…フィリフィリでした…気持ちいいです…	それフィリフィリフリでした	けきもしいいです!
その調子って……んんっ、こんなの、あぁっ、んっあぁん……んんっ、しょっ……あぁっ……だめ……んっ、あぁっ……	その調子って…んんっ、こんなの…はぁっ、んんっ…んっ、しょっ…はぁっ…ダメ…んっ、あっ…	その調子って…こんなの…ダメ…	その調子ってこんなの	その調子ってううんこんなのダメうん
はぁっ、あっ……んっ……くぅ、あぁっ……やぁ……それは、ん、はぁ……だめ、だ……あっ、んんっ、ふ……ひぃうっ！やめっ……ま、待ってくれ……あぁん……！	はぁっ、あっ、くぅぅっ…あっ、やっ、それはっ…はぁっ、ダメだっ…んんっ…ひぅぅんっ!やめっ…ま、待ってくれっ…あぁぁっ!	それは、ダメだ、やめ、待ってくれ	それはそれはダメだやめやめまってくれ	やめま待ってくれうう。
あは、はっ……んっ、くうっ……なん、だろこれ……気持ちいい、かも……んっ、あ、ああっ、はあっ、ふあぁ……やっ、くぅん	はぁっ、はぁっ、んっ…くぅっ…なん、だろこれ…気持ちいい、かも…んんっ、あっ、ああっ…ふぁぁっ、はやっ…んんっ…	あ、あ、あ、んっ、う、なんだろこれ、気持ちいいかも、あ、あ,あ、あ、う、うんっ	なんだろうこれ気持ちいいくも	うっなんだろうこれ。はあ気持ちいいかも。うわ!ううん。
だめ、センパイ……そんなにおち○ちん挿れたら、だめだぁっ……あっ、あぁぁぁっ……！	だめ、先輩…んっ、そんなに、おち○ちん挿れたら、だめ…はぁ、あぁぁ…っ	ダメ、先輩…そんなに陥れたらダメ…	ダメ先輩そんなに落ち入れたらダメな	ダメ先輩そんなに気入れたらダメだ。
やぁぁっ、こ、こらっ、おち○ちん、そんなに、びくびくさせないのっ……あぁっ、ひぃあぁぁっ……はぁっ、あぁっ、あぁぁぁんっ！！	ひゃんっ!こ、こらっ、おち○ちん、そんなにビクビクさせないのっ!ひぁっ、あっ、はぁっ、はぁっ!	いや、こ、こら、おじっちそんなにビクビクさせないの?いや、なにやろ	ここらじっちそんなにビクビクさせないの	もう全然そんなにビクビクさせないのうん!
やっ……あっ。……お兄ちゃんの舌が、あっ、中で、やあっ。……そんなりぐりぐりしちゃ、あっ、ふあっ。うくぅぅっ、ああっ、やあっ。	やっ、あっ、お兄ちゃんの舌が、中で…やぁっ、そんなにぐりぐりしちゃ…あっ、あっ、んっ、ふあぁっ、やぁぁっ…!	にゃー!お兄ちゃんの舌がお腹で…にゃー!そんなにグリグリした…にゃー!!	お兄ちゃんの下がお腹でニャーそんなにグリグリした	お兄ちゃんの舌がおなかでよそんなにグイグイさあぐっにゃん!
はっ、激しく……して。ンッ。あっ！はあっ、はあっ……わっ、私を……一気に……ンッ。イッ、イかせちゃってくださいッ！	は、激しく、して…んっ、あぅっ…私を、一気に…い、イかせちゃってください…!	あ、ゲンシ君、ステッ、アッ、アッ…私を一気に、行かせてあげください!	あげんしくして私は一気に行かせてください	激しく私も一輝行かせちゃってください!

Sucking Sounds

Ground Truth Text	Anime Whisper	whisper - large - v3	kotoba - whisper - v2.0	reazonspeech - nemo
れろっ、んっ……れろ、ちゅ、んちゅ	れろっ、れろっ、ちゅううっ	ううううう	わいしゅう	シュッ!
はっ、はい！んっ、れろっ、れろっ……あっ、れろっ	は、はい…っ、れろぉ…っ、れりゅっ、れりょっ…	わ、はぁい、わ、う、う、わ、へ、へ、へ	わあはい	はい。
れろっ、れろ……むふふ、ここの線なぞると反応いいね、んちゅ、ちゅうっ……ここいい？どう？	れろれろれろっ…んっ、ふふっ、ここの線なぞると反応いいね…ちゅっ、ちゅっ…ここいい?どう?	ここの線なぞると反応いいねここいい?どう?	ここの線なぞると反応いいねうんふうに	へへへここの線なぞると反応いいねここいい?どう?
あぁむ……ちゅ……れぇろれろ……ん……ん……ちゅ……れぇろ……んん……ちゅぅ……ちゅぱっ……れぇろれろ……	あむちゅっ…れろれろっ…んちゅっ、れろっ…ちゅぱちゅぷっ…れろぉっ…	アムー…	あん	おへん。
んちゅっ……れろれろ……れぇろ、ちゅっ、んれぇろれろ……ちゅっ、ちゅぱっ……	んちゅっ、れろれろっ、ちゅぱちゅぅっ…れろれろ、ちゅっ…ちゅぷっ…	お疲れ様でした	おくぬかんぱい	う。
ん……イク……ちゅるぅ……イッちゃう……ん……あぁっ、ちゅるるっ、イク……もう……らめぇ……んあぁむ……イク……イクぅぅ……	もう、イクっ…イッちゃう…んっ、んっ、じゅるるっ、イクっ、らめっ…んぁっ、イクッ、イクッ!	おーまいごーおまいごーまいごやめまいごよこー	お前	ママペイ君!
れぇろ…………んちゅ……れろれろ……ん……ちゅ……れろれろ……んれぇろれろ……ちゅ……	れろぉ…んちゅ、れろれろ…ちゅぱ…れろ、れろれちゅ…	エル…ラ…ル…ア…エル…ル…ツ…ン…エ…エル…ツ…ル…ア...エル…ル...プ…	えぇぇ
はぷっ、ちゅぷ、んん……はやく、おっきくして……ちんぽ……れろっ、ちゅ、ぴちゅ、ちゅぱっ……はやく勃起ちんぽちょうだい、勃起ちんぽ私にちょうだい	じゅぷっ、じゅぼっ!早くおっきくしてっ、ちんぽっ!んじゅるるるるるっ!はやくっ、はやく勃起ちんぽちょうらいっ、勃起ち○ぽあたしにちょうだいっ!	早く起きこして!チンポン!早く、早くポッキチンポンちょうだい! ポッキチンパン私にちょうだい!!	早く大きくしてチンポン早くポッキ全部全部私にちょうだい	早くおっきい子して。チープ!ん?早く早くボケ全部ちょうだい。ボケ全部私にちょうだい!
そっ、それじゃ……。あっ、はっ……がっ、がんばるぞ。ンッ！ああああっ！あっ、わっ、ボクも……んちっ、んむっ、んむっ、んんっ、むむっ。	そ、それじゃあ…はぁ、はぁ、が、頑張るぞ…んっ、あっ、あっ、も、ボクも…れろ、ちゅ、ちゅぱ、ちゅるるっ	それじゃあ、頑張るぞ!	それじゃあ頑張るぞ	そそれじゃあううがんばるぞ。
はむ、ちゅ、んんっ、れる……。んむっ、ふーっ、ふーっ。ここなんへ、ろうかひら？ちゅっ……じゅっ。……じゅるる。んっ、。	はむ…ちゅ、んんっ…ん、はむ…ここなんへ、どうかしら…ちゅっ、ちゅるるっ…	ここな…廊下平…	ここな廊下平	ん。ん?ここな?どうかしら。んっ。

Training Procedures 📚

Detailed training procedures, hyperparameters, and training codes will be published on [GitHub](https://github.com/litagin02/anime - whisper) soon.

Data split: The last tar file of all the data was reserved as test data, and the remaining 3,735,363 files were used for training.
Initial training: First, freeze the Encoder from the base model and train only the Decoder for several epochs.
Full - model training: Then, unfreeze the Encoder and train the entire model for several epochs.
Model optimization: After stopping the training, an attempt was made to improve performance by taking the average (merging) of models from one point in time to another. Optuna was used to optimize the CER on the benchmark data, and the result was used as the final model.

Environment 🖥

Hardware: The model was trained on an H100 NVL (VRAM 96GB) rented from vast.ai for less than 3 weeks through trial - and - error (initially, the base model was whisper - large - v3 - turbo, so this time is also included).
Training time: The actual training time used for this model was approximately H100 NVL * 11.2 days. However, the performance on the test data in the latter part was probably poor due to over - fitting, so these models were not used in the final merge.

🔧 Technical Details

There is no specific technical details content provided in the original document, so this section is skipped.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご