Model Overview
Model Features
Model Capabilities
Use Cases
๐ Anime Whisper ๐ค๐ค๐
Anime Whisper is a Japanese speech recognition model specifically fine - tuned for the domain of anime - style acting voices, offering high - performance and unique features.
Anime Whisper is a Japanese speech recognition model specialized in the domain of anime - style acting voices, especially for Japanese. This model is fine - tuned from the [kotoba - whisper - v2.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) base model using approximately 5,300 hours and 3.73 million files of anime - style voice and script datasets, such as Galgame_Speech_ASR_16kHz. Although it is specialized in the anime acting voice domain, it also demonstrates features and high performance not found in other models for other types of voices.
You can try the demo here: https://huggingface.co/spaces/litagin/anime - whisper - demo
๐ Quick Start
โจ Features
Anime Whisper generally has the following tendencies compared to other models:
- Fewer hallucinations: It produces fewer hallucinations during the speech - to - text process.
- Faithful transcription: It can accurately transcribe non - linguistic utterances such as stutters, laughter, shouts, and breaths that are often skipped by other models.
- Appropriate punctuation: Punctuations like
ใใ!?โฆ
are added appropriately according to the rhythm and emotion of the speech, resulting in a natural - sounding script. - High accuracy for anime voices: It shows particularly high accuracy for anime - style acting voices.
- Lightweight and fast: Based on [kotoba - whisper](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) (a distilled model of [whisper - large - v3](https://huggingface.co/openai/whisper - large - v3)), it is lightweight and fast.
- NSFW voice transcription: It can transcribe NSFW voices in a proper style, which is almost impossible for other models.
๐ฆ Installation
There is no specific installation content provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
import torch
from transformers import pipeline
generate_kwargs = {
"language": "Japanese",
"no_repeat_ngram_size": 0,
"repetition_penalty": 1.0,
}
pipe = pipeline(
"automatic - speech - recognition",
model="litagin/anime - whisper",
device="cuda",
torch_dtype=torch.float16,
chunk_length_s=30.0,
batch_size=64,
)
audio_path = "test.wav"
result = pipe(audio_path, generate_kwargs=generate_kwargs)
print(result["text"])
- Multiple files inference: If you want to infer multiple files at once, simply pass a list of file paths to
pipe
. - Suppressing hallucinations: If repeated hallucinations are noticeable, you can set
no_repeat_ngram_size: int
to around 5 - 10 or setrepetition_penalty
to a value greater than 1 to suppress them.
๐ Documentation
Evaluation ๐
Detailed evaluation reports, observation reports, and evaluation codes will be published on the [GitHub repository](https://github.com/litagin02/anime - whisper).
CER (Character Error Rate)
- Evaluation data: Evaluated on 5 personal novel games (approximately 75k files in total) that are in the same anime - style dialogue domain as the training data but not included in the training data.
- Generation parameters: Generated with the
no_repeat_ngram_size = 5
parameter to suppress repeated hallucinations in OpenAI's Whisper series. - CER calculation: CER is calculated based on the appropriately normalized results.
Table
Model Name | game1 | game2 | game3 | game4 | game5 | avg |
---|---|---|---|---|---|---|
[openai/whisper - large](https://huggingface.co/openai/whisper - large) | 15.11 | 20.24 | 14.89 | 17.95 | 19.37 | 17.5 |
[openai/whisper - large - v2](https://huggingface.co/openai/whisper - large - v2) | 15.11 | 20.12 | 14.83 | 17.65 | 18.59 | 17.3 |
[openai/whisper - large - v3](https://huggingface.co/openai/whisper - large - v3) | 14.60 | 18.66 | 14.43 | 17.29 | 17.74 | 16.5 |
[openai/whisper - large - v3 - turbo](https://huggingface.co/openai/whisper - large - v3 - turbo) | 15.18 | 19.24 | 14.43 | 17.38 | 18.15 | 16.9 |
[reazon - research/reazonspeech - nemo - v2](https://huggingface.co/reazon - research/reazonspeech - nemo - v2) | 23.92 | 25.08 | 20.29 | 25.91 | 22.71 | 23.6 |
[nvidia/parakeet - tdt_ctc - 0.6b - ja](https://huggingface.co/nvidia/parakeet - tdt_ctc - 0.6b - ja) | 17.67 | 20.44 | 15.33 | 19.60 | 19.86 | 18.6 |
[kotoba - tech/kotoba - whisper - v1.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v1.0) | 16.62 | 21.54 | 16.42 | 19.83 | 20.01 | 18.9 |
[kotoba - tech/kotoba - whisper - v2.0](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) | 16.38 | 21.51 | 16.51 | 19.69 | 20.04 | 18.8 |
Anime Whisper | 11.32 | 16.52 | 11.16 | 12.78 | 13.23 | 13.0 |
Bias and Other Issues ๐จ
- Proper nouns: Proper nouns such as personal names in the visual novels of the training data are often transcribed in the Chinese characters used in the game.
- Specific words: Some specific words in the dataset may be transcribed differently from the norm (e.g.,
ใใใ
โ่บซไฝ
and other proper nouns). - Normalization effects: Due to dataset normalization, the following rarely appear in the output:
- Consecutive vowels or long vowels:
ใใใใใผใผใผใผ
- Consecutive exclamation marks:
ใใใผใฃ!!!!
ใชใซใใ!?!?!?!?
- Consecutive ellipses:
โฆโฆ
(usually only oneโฆ
is output instead of the correct twoโฆโฆ
in Japanese notation).
- Consecutive vowels or long vowels:
- Number, alphabet, and exclamation mark: Numbers, alphabets, and exclamation marks are transcribed in half - width characters.
- Ending punctuation: The ending
ใ
is almost always omitted. - Vulgar language: Transcriptions of some vulgar language may contain censored characters like
โ
.
Examples ๐
This is a comparison of transcriptions of dialogue from a novel game that is not included in the training data (generated with no_repeat_ngram_size = 5
as well).
The results show that Anime Whisper generally performs as well as whisper - large - v3. The following are some examples highlighting the significant differences from other models, especially for non - linguistic utterances and emotional voices.
Ground Truth Text | Anime Whisper | whisper - large - v3 | kotoba - whisper - v2.0 | reazonspeech - nemo |
---|---|---|---|---|
ใใใใใฃ๏ผใใใใใฃ๏ผ | ใฏใใใฃใใใใใใฃโฆ! | ใใใใใใใใใใ | ใใใใ | ใใ! |
ใใฃใใใฃใโฆโฆใโฆโฆโฆใโฆโฆใใใชใใ ใ | ใใฃโฆใใฃใโฆใใใชใใ โฆ | ใใฃใโฆใใใชใใ โฆ | ใใฃใโฆใใใชใใ | ใใฃใใฃใใใฃใใใชใใ ใ |
ใใถใใใผใใๅใคใใฏใ | ใใถใใใใฏใๅใคใใฏใ | ๅคๅใๅใๅใคใฏใใ | ๅคๅๅใๅใคใฏใ | ๅใๅใคใฏใใ |
ใใใใปใฃโฆโฆใชใใ ใใใค๏ผ | ใใปใฃใใใปใฃโฆใชใใ ใใใใคโฆ | ใชใใ ใใใใคโฆ | ใชใใ ใใใค | ใใไฝใ ใใใคใ |
ใฏใฃใใฏใใใใใงใใโฆโฆใใฎใใใฃใจใใธใฃใๅคใ ใฃใใงใใใใ๏ผ | ใฏใใฏใใใใใงใโฆใใฎใใใจโฆใธใๅคใ ใฃใใงใใใใโฆ? | ใใใฏใใใใใงใใใใใใฃใจใใธใๅคใ ใฃใใงใใใใใ | ใฏใใใใงใใใใใจๅคใ ใฃใใงใใใใ | ใใฃใฏใใใใงใใใใใฃใจใธๅคใ ใฃใใงใใใใ? |
ใถใถใถใถ่ฑใฏใฝใใกใกใก๏ผๅพ ใฆใณใซใกใกใก๏ผ | ใถใถใถใถใถใใถใใใใใผ!ๅพ ใฆใใใใ! | ๅพ ใฆใใใผ | ๅพ ใฆใใใ | ๅพ ใฆใใ! |
ๅฐ้ขใๆบใใใจใใใใโฆโฆใใใฃ๏ผ | ๅฐ้ขใๆบใใใจใใใใโฆใฒใใฃ!? | ๅฐ้ขใๆบใใใจใใใใ? | ๅฐ้ขใๆบใใใจใใใใ | ใใฃ! |
ใใใฃใปใ๏ผใใใใใฃใใใใ ใใพใผใ๏ผ | ใใใฃใปใ!ใใใใใใใใ ใใพใผใ! | ใญใฃใใใผ!ใใใใใ ใใพใ! | ใญใฃใใผ!ใใใ ใใพใ! | ใใใใใใ ใใพใ! |
โฆโฆใฃใใฏใโฆโฆใใใใใใไปๆฅใฏโฆโฆ | ใใฃใใฏใโฆใใ็งใไปๆฅใฏโฆ | ็งใไปๆฅใฏโฆ | ็งไปๆฅใฏ | ใใฃใจ็งไปๆฅใ |
โฆโฆใทใตใฃใใณใใใใฃใใใฃใใใฃโฆโฆใทใตใฃใใใฃใใใตใตใฃใใใฃใไพกๅค่ฆณ | ใใตใตใฃโฆใใใใฏใฃโฆใทใฃโฆใฏใใฃโฆใใไพกๅค่ฆณใฃโฆ | ไพกๅค่ฆณ! | ไพกๅค่ฆณ | ใใใใกใใ! |
ใใ็ใใใญใโฆโฆใใใชใใใโฆโฆ๏ผ | ใใ็ใใใญใโฆใใใชใใใใฃโฆ! | ใใๅๅพฉใใญใใใใใชใใใฌใ | ใใใใใใญใใใใช | ใใใใใญใใใใชใใใ |
ใฒใใฃ๏ผใใใใ ใใใใใฃใโฆโฆใใใใฃใใใฏใใใฏใฏใฃ | ใฒใใใฃ!ใใฃใใใ ใฃโฆใใใใฃใใฃโฆใใฃใใใฃใใฏใใฃใใใฏใฏใฃ! | ใใ !ใใใ ! | ใใ | ใใฃใป! |
ใตใใใๆฅใซๆญขใพใใชใใงใใโฆโฆ | ใตใใใๆฅใซๆญขใพใใชใใงใใ | ใใธใใๆฅใซๆญขใพใใชใใงใ | ใใธใๆฅใซๆญขใพใใชใใงใ | ๆฅใซๆญขใพใใชใใงใใ |
ใใใ๏ผ๏ผใญใญใใชใใงใ็งใผ๏ผ | ใใใ50ใญใญใใชใใงใ็งใผ! | 50ใญใญใใชใใงใ็ง! | 550ใญใญใใชใใงใ็ง | 50ใญใญใใชใใงใใใใ! |
ใใใใใใณใฐใใใใใณใฐใใผใใฃ | ใใใใณใฐใใใใณใฐใใผใ! | ใใใใ! ใบใใใซ10! ใบใใใซ10! | ใใใบใใใผใใณ! | ใใฟใพใใใใฟใพใใใ |
้ๆใใ่ฒดๆงใกใกใก๏ผ | ้ๆใใ่ฒดๆงใใใฃ! | ใใใฑใซใญๆง! | ใพใฌใใใใใพ | ๆใใ่ฒดๆง! |
ใทใใใโฆโฆใฒใฃใใฒใใฃโฆโฆ | ใใฃโฆใใใใฃโฆใทใฃโฆใใใฃโฆ | ใ่ฆ่ดใใใใจใใใใใพใใ | ใใ | ใใใใใใใใใใใ |
ใญใใฏโฆโฆใใใฃใใฏใฃโฆโฆใๆๅใใโฆโฆใใใฃใใใฃใๅฎน่ตฆใใชใใช | ๅใฏใโฆใฏใใฃใใฏใใฃโฆๆๅใใโฆใใใฃใใใฃใๅฎน่ตฆใใชใใชใโฆ | ๅใฏโฆโฆๆๅใใๅฎน่ตฆใใชใใช | ๅใฏๆๅใใใใใๅฎน่ตฆใใชใใช | ๅใฏๆๅใใใใฃใใๅฎน่ตฆใใชใใชใใ |
ๆใใงใใใโฆโฆใใฎใฃใใฎใฃใใฎใฃโฆโฆๆใใงใใใงใใไธ็ใ็ตใใใฐใใใฃใฆโฆโฆๅผทใใๅผทใใฃใใฏใใฃใใฏใใฃ | ๆใใงใใใโฆใฎใใฎใใฎโฆๆใใงใใใงใโฆไธ็ใ็ตใใใฐใใใฃใฆใๅผทใใๅผทใโฆใฏใใฃ | ๆใใงใใใโฆๆใใงใใใงใโฆไธ็ใ็ตใใใฐใใใฃใฆโฆๅผทใโฆๅผทใโฆ | ๆใใงใใใโฆใฎใฎใใใงใใใงใไธ็ใ็ตใใใฐใใใฃใฆๅผทใๅผทใ | ใ?ๆใใงใใใงใใไธ็ใ็ตใใใฐใใใฃใฆๅผทใๅผทใใ |
NSFW Examples ๐ซฃ
Please note that these examples contain adult - oriented expressions.
Panting Sounds
Ground Truth Text | Anime Whisper | whisper - large - v3 | kotoba - whisper - v2.0 | reazonspeech - nemo |
---|---|---|---|---|
ใฒใฃใใใฃ๏ผใใ ใฃใใใใใใใใใใใใฃ๏ผใฏใฃใใฏใฃใใฏใฃใใฏใฃใใฒใใฃ๏ผ | ใใใฃใใใฃใใใฃใใใใใใฃ!ใใฃใใฏใใฃใใฏใใฃโฆใใฃใใตใใใใฃ! | ใ่ฆ่ดใใใใจใใใใใพใใ | ใขใใใ | ใใ! |
ใกใใกใใฃโฆโฆใใฃใใใใโฆโฆๆฐๆใกใใใใใใโฆโฆใใใฃใใใใฃใๅพ ใฆใจโฆโฆใใใฃใใฏใโฆโฆใใตใ ใฃโฆโฆ | ใกใใกใใฃโฆใฏใใฃใใฏใใฃใๆฐๆใกใใใใใใใฃโฆใใฃใใใใฃใๅพ ใฆใจใฃโฆใใใฃใใฏใใใฏใใฃโฆ | ใกใใกใโฆๆฐๆใกใใใใใโฆๅพ ใฆใจโฆ | ใกใกใๆฐๆใกใใใใใๅพ ใฆใจ | ใก้ใใฏใๆฐๆใกใใใใใๅพ ใฆใจใใฃใ |
ใใใฃ๏ผใใฃใใใฃโฆโฆใใฃใใใโฆโฆใใฃใใฏใใฏใใฏใใใณใณใณใณใ๏ผใดใฃใใดใใดใใใฃใฆใใฆโฆโฆใใใใฃ๏ผใฏใใฏใใฏใใใใฃใใใใกโฆโฆใใใงใ๏ผ | ใตใใใฃ!ใใฃใใใใฃ!ใใฃใใใใฃโฆใใฃใใฏใใฃใใฏใใฃโฆใใใฃ!ใดใใดใใดใใฃใฆใใใฆโฆใฒใใฃ!ใฏใฃใใฏใใใฏใใฃโฆ!ใใๆฐๆใกใใใใงใใฃโฆ! | ใใโฆใใฃใชใใฃใชใงใใโฆๆฐๆใกใใใงใโฆ | ใใใใฃใชใใฃใชใใชใงใใ | ใใใใใใใงใ! |
ใใฎ่ชฟๅญใฃใฆโฆโฆใใใฃใใใใชใฎใใใใฃใใใฃใใใโฆโฆใใใฃใใใใฃโฆโฆใใใฃโฆโฆใ ใโฆโฆใใฃใใใใฃโฆโฆ | ใใฎ่ชฟๅญใฃใฆโฆใใใฃใใใใชใฎโฆใฏใใฃใใใใฃโฆใใฃใใใใฃโฆใฏใใฃโฆใใกโฆใใฃใใใฃโฆ | ใใฎ่ชฟๅญใฃใฆโฆใใใชใฎโฆใใกโฆ | ใใฎ่ชฟๅญใฃใฆใใใชใฎ | ใใฎ่ชฟๅญใฃใฆใใใใใใชใฎใใกใใ |
ใฏใใฃใใใฃโฆโฆใใฃโฆโฆใใ ใใใใฃโฆโฆใใโฆโฆใใใฏใใใใฏใโฆโฆใ ใใใ โฆโฆใใฃใใใใฃใใตโฆโฆใฒใใใฃ๏ผใใใฃโฆโฆใพใๅพ ใฃใฆใใโฆโฆใใใโฆโฆ๏ผ | ใฏใใฃใใใฃใใใ ใ ใฃโฆใใฃใใใฃใใใใฏใฃโฆใฏใใฃใใใกใ ใฃโฆใใใฃโฆใฒใ ใ ใใฃ!ใใใฃโฆใพใๅพ ใฃใฆใใใฃโฆใใใใฃ! | ใใใฏใใใกใ ใใใใๅพ ใฃใฆใใ | ใใใฏใใใฏใใกใ ใใใใใพใฃใฆใใ | ใใใพๅพ ใฃใฆใใใใใ |
ใใฏใใฏใฃโฆโฆใใฃใใใใฃโฆโฆใชใใใ ใใใโฆโฆๆฐๆใกใใใใใโฆโฆใใฃใใใใใใฃใใฏใใฃใใตใใโฆโฆใใฃใใใ ใ | ใฏใใฃใใฏใใฃใใใฃโฆใใ ใฃโฆใชใใใ ใใใโฆๆฐๆใกใใใใใโฆใใใฃใใใฃใใใใฃโฆใตใใใฃใใฏใใฃโฆใใใฃโฆ | ใใใใใใใใฃใใใใชใใ ใใใใๆฐๆใกใใใใใใใใ,ใใใใใใใใใฃ | ใชใใ ใใใใๆฐๆใกใใใใ | ใใฃใชใใ ใใใใใใฏใๆฐๆใกใใใใใใใ!ใใใใ |
ใ ใใใปใณใใคโฆโฆใใใชใซใใกโใกใๆฟใใใใใ ใใ ใใฃโฆโฆใใฃใใใใใใฃโฆโฆ๏ผ | ใ ใใๅ ่ผฉโฆใใฃใใใใชใซใใใกโใกใๆฟใใใใใ ใโฆใฏใใใใใโฆใฃ | ใใกใๅ ่ผฉโฆใใใชใซ้ฅใใใใใกโฆ | ใใกๅ ่ผฉใใใชใซ่ฝใกๅ ฅใใใใใกใช | ใใกๅ ่ผฉใใใชใซๆฐๅ ฅใใใใใกใ ใ |
ใใใใฃใใใใใใฃใใใกโใกใใใใใชใซใใณใใณใใใใชใใฎใฃโฆโฆใใใฃใใฒใใใใใฃโฆโฆใฏใใฃใใใใฃใใใใใใใฃ๏ผ๏ผ | ใฒใใใฃ!ใใใใใฃใใใกโใกใใใใใชใซใใฏใใฏใใใชใใฎใฃ!ใฒใใฃใใใฃใใฏใใฃใใฏใใฃ! | ใใใใใใใใใใใฃใกใใใชใซใใฏใใฏใใใชใใฎ?ใใใใชใซใใ | ใใใใใฃใกใใใชใซใใฏใใฏใใใชใใฎ | ใใๅ จ็ถใใใชใซใใฏใใฏใใใชใใฎใใ! |
ใใฃโฆโฆใใฃใโฆโฆใๅ ใกใใใฎ่ใใใใฃใไธญใงใใใใฃใโฆโฆใใใชใใใใใใใกใใใใฃใใตใใฃใใใใ ใ ใฃใใใใฃใใใใฃใ | ใใฃใใใฃใใๅ ใกใใใฎ่ใใไธญใงโฆใใใฃใใใใชใซใใใใใใกใโฆใใฃใใใฃใใใฃใใตใใใฃใใใใใฃโฆ! | ใซใใผ!ใๅ ใกใใใฎ่ใใ่ นใงโฆใซใใผ!ใใใชใซใฐใชใฐใชใใโฆใซใใผ!! | ใๅ ใกใใใฎไธใใ่ นใงใใฃใผใใใชใซใฐใชใฐใชใใ | ใๅ ใกใใใฎ่ใใใชใใงใใใใชใซใฐใคใฐใคใใใใฃใซใใ! |
ใฏใฃใๆฟใใโฆโฆใใฆใใณใใใใฃ๏ผใฏใใฃใใฏใใฃโฆโฆใใฃใ็งใโฆโฆไธๆฐใซโฆโฆใณใใใคใใใคใใใกใใฃใฆใใ ใใใ๏ผ | ใฏใๆฟใใใใใฆโฆใใฃใใใ ใฃโฆ็งใใไธๆฐใซโฆใใใคใใใกใใฃใฆใใ ใใโฆ! | ใใใฒใณใทๅใในใใใใขใใใขใโฆ็งใไธๆฐใซใ่กใใใฆใใใใ ใใ! | ใใใใใใใฆ็งใฏไธๆฐใซ่กใใใฆใใ ใใ | ๆฟใใ็งใไธ่ผ่กใใใกใใฃใฆใใ ใใ! |
Sucking Sounds
Ground Truth Text | Anime Whisper | whisper - large - v3 | kotoba - whisper - v2.0 | reazonspeech - nemo |
---|---|---|---|---|
ใใใฃใใใฃโฆโฆใใใใกใ ใใใกใ | ใใใฃใใใใฃใใกใ ใใใฃ | ใใใใใ | ใใใใ ใ | ใทใฅใ! |
ใฏใฃใใฏใ๏ผใใฃใใใใฃใใใใฃโฆโฆใใฃใใใใฃ | ใฏใใฏใโฆใฃใใใใโฆใฃใใใใ ใฃใใใใใฃโฆ | ใใใฏใใใใใใใใใใใใธใใธใใธ | ใใใฏใ | ใฏใใ |
ใใใฃใใใโฆโฆใใตใตใใใใฎ็ทใชใใใจๅๅฟใใใญใใใกใ ใใกใ ใใฃโฆโฆใใใใ๏ผใฉใ๏ผ | ใใใใใใใฃโฆใใฃใใตใตใฃใใใใฎ็ทใชใใใจๅๅฟใใใญโฆใกใ ใฃใใกใ ใฃโฆใใใใ?ใฉใ? | ใใใฎ็ทใชใใใจๅๅฟใใใญใใใใ?ใฉใ? | ใใใฎ็ทใชใใใจๅๅฟใใใญใใใตใใซ | ใธใธใธใใใฎ็ทใชใใใจๅๅฟใใใญใใใใ?ใฉใ? |
ใใใโฆโฆใกใ โฆโฆใใใใใโฆโฆใโฆโฆใโฆโฆใกใ โฆโฆใใใโฆโฆใใโฆโฆใกใ ใ โฆโฆใกใ ใฑใฃโฆโฆใใใใใโฆโฆ | ใใใกใ ใฃโฆใใใใใฃโฆใใกใ ใฃใใใใฃโฆใกใ ใฑใกใ ใทใฃโฆใใใใฃโฆ | ใขใ ใผโฆ | ใใ | ใใธใใ |
ใใกใ ใฃโฆโฆใใใใโฆโฆใใใใใกใ ใฃใใใใใใใโฆโฆใกใ ใฃใใกใ ใฑใฃโฆโฆ | ใใกใ ใฃใใใใใใฃใใกใ ใฑใกใ ใ ใฃโฆใใใใใใกใ ใฃโฆใกใ ใทใฃโฆ | ใ็ฒใๆงใงใใ | ใใใฌใใใฑใ | ใใ |
ใโฆโฆใคใฏโฆโฆใกใ ใใ โฆโฆใคใใกใใโฆโฆใโฆโฆใใใฃใใกใ ใใใฃใใคใฏโฆโฆใใโฆโฆใใใโฆโฆใใใใโฆโฆใคใฏโฆโฆใคใฏใ ใ โฆโฆ | ใใใใคใฏใฃโฆใคใใกใใโฆใใฃใใใฃใใใ ใใใฃใใคใฏใฃใใใใฃโฆใใใฃใใคใฏใใใคใฏใ! | ใใผใพใใใผใใพใใใผใพใใใใใพใใใใใผ | ใๅ | ใใใใคๅ! |
ใใใโฆโฆโฆโฆใใกใ โฆโฆใใใใโฆโฆใโฆโฆใกใ โฆโฆใใใใโฆโฆใใใใใใโฆโฆใกใ โฆโฆ | ใใใโฆใใกใ ใใใใใโฆใกใ ใฑโฆใใใใใใใกใ โฆ | ใจใซโฆใฉโฆใซโฆใขโฆใจใซโฆใซโฆใโฆใณโฆใจโฆใจใซโฆใโฆใซโฆใข...ใจใซโฆใซ...ใโฆ | ใใใ | |
ใฏใทใฃใใกใ ใทใใใโฆโฆใฏใใใใใฃใใใใฆโฆโฆใกใใฝโฆโฆใใใฃใใกใ ใใดใกใ ใใกใ ใฑใฃโฆโฆใฏใใๅ่ตทใกใใฝใกใใใ ใใๅ่ตทใกใใฝ็งใซใกใใใ ใ | ใใ ใทใฃใใใ ใผใฃ!ๆฉใใใฃใใใใฆใฃใใกใใฝใฃ!ใใใ ใใใใใใฃ!ใฏใใใฃใใฏใใๅ่ตทใกใใฝใกใใใใใฃใๅ่ตทใกโใฝใใใใซใกใใใ ใใฃ! | ๆฉใ่ตทใใใใฆ!ใใณใใณ!ๆฉใใๆฉใใใใญใใณใใณใกใใใ ใ! ใใใญใใณใใณ็งใซใกใใใ ใ!! | ๆฉใๅคงใใใใฆใใณใใณๆฉใใใใญๅ จ้จๅ จ้จ็งใซใกใใใ ใ | ๆฉใใใฃใใๅญใใฆใใใผใ!ใ?ๆฉใๆฉใใใฑๅ จ้จใกใใใ ใใใใฑๅ จ้จ็งใซใกใใใ ใ! |
ใใฃใใใใใโฆโฆใใใฃใใฏใฃโฆโฆใใฃใใใใฐใใใใณใ๏ผใใใใใฃ๏ผใใฃใใใฃใใใฏใโฆโฆใใกใฃใใใใฃใใใใฃใใใใฃใใใใฃใ | ใใใใใใใโฆใฏใใใฏใใใใ้ ๅผตใใโฆใใฃใใใฃใใใฃใใใใใฏใโฆใใใใกใ ใใกใ ใฑใใกใ ใใใฃ | ใใใใใใ้ ๅผตใใ! | ใใใใใ้ ๅผตใใ | ใใใใใใใใใใใฐใใใ |
ใฏใใใกใ ใใใใฃใใใโฆโฆใใใใฃใใตใผใฃใใตใผใฃใใใใชใใธใใใใใฒใ๏ผใกใ ใฃโฆโฆใใ ใฃใโฆโฆใใ ใใใใใฃใใ | ใฏใโฆใกใ ใใใใฃโฆใใใฏใโฆใใใชใใธใใฉใใใใโฆใกใ ใฃใใกใ ใใใฃโฆ | ใใใชโฆๅปไธๅนณโฆ | ใใใชๅปไธๅนณ | ใใใ?ใใใช?ใฉใใใใใใใฃใ |
Training Procedures ๐
Detailed training procedures, hyperparameters, and training codes will be published on [GitHub](https://github.com/litagin02/anime - whisper) soon.
- Data split: The last tar file of all the data was reserved as test data, and the remaining 3,735,363 files were used for training.
- Initial training: First, freeze the Encoder from the base model and train only the Decoder for several epochs.
- Full - model training: Then, unfreeze the Encoder and train the entire model for several epochs.
- Model optimization: After stopping the training, an attempt was made to improve performance by taking the average (merging) of models from one point in time to another. Optuna was used to optimize the CER on the benchmark data, and the result was used as the final model.
Environment ๐ฅ
- Hardware: The model was trained on an H100 NVL (VRAM 96GB) rented from vast.ai for less than 3 weeks through trial - and - error (initially, the base model was whisper - large - v3 - turbo, so this time is also included).
- Training time: The actual training time used for this model was approximately H100 NVL * 11.2 days. However, the performance on the test data in the latter part was probably poor due to over - fitting, so these models were not used in the final merge.
๐ง Technical Details
There is no specific technical details content provided in the original document, so this section is skipped.
๐ License
This project is licensed under the MIT license.

