🚀 Japanese Speech Recognition Model
This project focuses on Japanese transcription, especially in the anime - adjacent domain, aiming for accurate and non - hallucinated results.
🚀 Quick Start
WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps with a batch size of 8 (approximately 160 hours on a 3060). Almost certainly undertrained.
✨ Features
- Japanese Transcription: Specialized in transcribing Japanese audio.
- Anime - Adjacent Focus: Targets the anime - related domain.
- No Hallucination: Strives to avoid generating false transcriptions.
- Drop - in Replacement: Can be used as a replacement in relevant scenarios (trained 50% with prompt, 25% notimestamps).
📚 Documentation
Goals
- Japanese transcription
- Focus on anime adjacent domain
- No hallucination
- Drop in replacement (trained 50% with prompt, 25% notimestamps)
Acknowledgements
- Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs
- Validation sets: simon3000, grider - withourai, kotoba - tech
- Test sets: KitsuneX07, TEDxJP
Test set
|
air |
himanatsu |
kanon |
proseka |
sakuuta |
tedxjp |
turbo_b1 |
25.8 |
60.6 |
22.5 |
13.1 |
21.1 |
10.8 |
turbo_b5 |
20.9 |
48.3 |
19.1 |
11.8 |
18.9 |
|
turbo_b1_nt |
25.8 |
61.6 |
23.1 |
13.6 |
20.4 |
|
turbo_b5_nt |
17.1 |
25.8 |
23.5 |
9.4 |
12.5 |
|
anime_b1 |
15.9 |
20.2 |
12.8 |
8.9 |
10.9 |
41.8 |
anime_b5 |
14.4 |
18.3 |
12.6 |
8.6 |
10.0 |
|
anime_b1_n5 |
15.0 |
18.4 |
12.7 |
8.9 |
10.1 |
|
anime_b5_n5 |
14.4 |
18.1 |
12.5 |
8.6 |
10.0 |
|
anime_b1_nt |
14.4 |
18.7 |
11.4 |
8.3 |
10.1 |
|
anime_b5_nt |
13.4 |
17.5 |
11.4 |
8.1 |
9.6 |
|
|
|
|
|
|
|
|
b1 |
15.6 |
20.1 |
11.8 |
8.8 |
10.5 |
11.5 |
b5 |
15.2 |
19.8 |
11.6 |
8.8 |
10.7 |
|
b1_nt |
15.6 |
20.1 |
11.9 |
8.7 |
10.5 |
|
b5_nt |
15.3 |
19.4 |
11.8 |
8.6 |
10.5 |
|
-
b1 beam_size = 1
-
b5 beam_size = 5
-
n5 no_repeat_ngram_size = 5
-
nt <|notimestamps|>
-
Anime sets are worse compared to anime - whisper but better than turbo (out of domain).
-
273 videos from TEDxJP - 10K with youtube subtitles for long - form transcription using faster - whisper.
-
Slightly worse than turbo. Kotoba/anime - whisper is not trained for long - form.
Validation set
Used only for hyperparameter optimization.
|
bluearchive |
genshin5.1 |
nekopara |
genshin |
starrail |
reazon |
jsut |
cv8 |
cv19 |
jsl |
loopers |
tedx10 |
[large - v3_b1](https://huggingface.co/openai/whisper - large - v3) |
12.2 |
10.1 |
70.8 |
11.9 |
10.0 |
16.0 |
7.1 |
8.6 |
15.1 |
12.2 |
|
7.7 |
large - v3_b5 |
11.0 |
10.0 |
63.7 |
11.6 |
9.8 |
14.1 |
7.1 |
8.3 |
14.8 |
11.0 |
|
|
[large - v2_b1](https://huggingface.co/openai/whisper - large - v2) |
|
14.4 |
103.4 |
18.3 |
12.9 |
31.6 |
8.2 |
9.8 |
18.5 |
18.0 |
|
8.0 |
large - v2_b5 |
|
12.7 |
100.9 |
16.8 |
12.9 |
28.0 |
8.0 |
9.5 |
17.5 |
16.2 |
|
|
[turbo_b1](https://huggingface.co/openai/whisper - large - v3 - turbo) |
12.8 |
11.1 |
72.3 |
11.6 |
11.1 |
11.6 |
7.3 |
9.6 |
17.5 |
12.0 |
28.0 |
7.9 |
turbo_b5 |
10.4 |
10.0 |
64.3 |
12.0 |
10.2 |
10.4 |
7.2 |
9.1 |
16.6 |
10.8 |
20.2 |
8.8 |
[kotoba - v1_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - v1.0) |
8.5 |
9.4 |
27.8 |
9.9 |
10.3 |
12.7 |
8.4 |
9.5 |
17.1 |
12.2 |
|
34.9 |
kotoba - v1_b5 |
8.4 |
9.3 |
27.8 |
9.8 |
10.3 |
12.3 |
8.3 |
9.3 |
16.7 |
12.1 |
|
|
[kotoba - v2_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0) |
8.5 |
9.6 |
27.7 |
10.2 |
10.4 |
11.6 |
8.2 |
9.2 |
16.9 |
12.3 |
|
25.3 |
kotoba - v2_b5 |
8.6 |
9.5 |
27.7 |
10.1 |
10.5 |
11.4 |
8.2 |
9.0 |
16.6 |
12.2 |
|
|
[kotoba - bi_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - bilingual - v1.0) |
8.9 |
10.1 |
28.1 |
10.5 |
10.8 |
17.5 |
9.1 |
9.8 |
17.5 |
12.7 |
|
27.8 |
kotoba - bi_b5 |
8.8 |
10.0 |
28.0 |
10.5 |
10.7 |
17.1 |
9.1 |
9.6 |
17.2 |
12.6 |
|
|
[anime_b1](https://huggingface.co/litagin/anime - whisper) |
7.5 |
11.5 |
24.7 |
11.0 |
11.2 |
30.1 |
8.0 |
10.0 |
19.1 |
9.0 |
18.9 |
32.0 |
anime_b5 |
7.2 |
10.4 |
22.0 |
10.3 |
10.4 |
26.6 |
7.8 |
9.8 |
18.8 |
8.5 |
15.3 |
51.8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
b1 |
6.9 |
6.3 |
22.8 |
6.7 |
7.4 |
16.2 |
7.1 |
8.9 |
17.1 |
8.5 |
14.7 |
8.2 |
b5 |
7.5 |
6.2 |
22.8 |
6.6 |
7.3 |
15.7 |
7.0 |
8.7 |
17.0 |
8.5 |
14.5 |
9.1 |
- bluearchive.wiki: Beam 5 performs worse due to the extra usage of kana. Is it learned from MiHoYo games?
- genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
- nekopara: Hallucination test. Anime would be better if not for the increased hallucination. OpenAI is unusable.
- genshin/starrail: Mostly in the train set.
- reazon: Significantly higher CER from transcribing background/secondary audio.
- jsut: Surprisingly good?
- cv8: cv19 train includes some of cv8 test.
- cv19: No contamination, struggles with accents.
- jsl: Anime set.
- loopers: Anime set, has hallucination - prone audio.
- tedxjp: 10 - video subset. See comments in the test set. b1 = batched, b5 = sequential, beam_size = 1, temperature = 0, condition_on_previous_text = False