Whisper-ja-anime-v0.1 Open-source Model - Focused on Japanese Anime Speech Recognition, Accurately Adapted to Anime Audio

Whisper Ja Anime V0.1

Developed by efwkjn

A Whisper variant model focused on speech recognition in the Japanese anime field, optimized for the characteristics of anime audio

Speech Recognition

Safetensors

Japanese#Exclusive for Japanese anime #Low hallucination generation #Timestamp support

Downloads 205

Release Time : 12/15/2024

Model Overview

A Japanese automatic speech recognition model based on the Whisper architecture, specifically optimized and trained for anime-related audio to reduce hallucination generation and improve transcription accuracy

Model Features

Optimized for the anime domain

Specifically trained for the characteristics of anime audio, performs better than general models on anime-related test sets

Anti-hallucination design

Reduce incorrect transcription and hallucination generation through training strategies

Flexible usage modes

Supports both timestamped and non-timestamped usage modes

Model Capabilities

Japanese speech transcription

Anime audio recognition

Long text processing

Timestamped transcription

Use Cases

Anime content production

Anime subtitle generation

Automatically generate Japanese subtitles for anime videos

The CER on the anime test set is lower than that of general models

Anime dialogue analysis

Used for content analysis and retrieval of anime dialogues

Game audio processing

Game speech transcription

Transcribe Japanese speech in games such as Genshin Impact and Honkai: Star Rail

Performs excellently on game test sets

🚀 Japanese Speech Recognition Model

This project focuses on Japanese transcription, especially in the anime - adjacent domain, aiming for accurate and non - hallucinated results.

🚀 Quick Start

WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps with a batch size of 8 (approximately 160 hours on a 3060). Almost certainly undertrained.

✨ Features

Japanese Transcription: Specialized in transcribing Japanese audio.
Anime - Adjacent Focus: Targets the anime - related domain.
No Hallucination: Strives to avoid generating false transcriptions.
Drop - in Replacement: Can be used as a replacement in relevant scenarios (trained 50% with prompt, 25% notimestamps).

📚 Documentation

Goals

Japanese transcription
Focus on anime adjacent domain
No hallucination
Drop in replacement (trained 50% with prompt, 25% notimestamps)

Acknowledgements

Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs
Validation sets: simon3000, grider - withourai, kotoba - tech
Test sets: KitsuneX07, TEDxJP

Test set

	air	himanatsu	kanon	proseka	sakuuta	tedxjp
turbo_b1	25.8	60.6	22.5	13.1	21.1	10.8
turbo_b5	20.9	48.3	19.1	11.8	18.9
turbo_b1_nt	25.8	61.6	23.1	13.6	20.4
turbo_b5_nt	17.1	25.8	23.5	9.4	12.5
anime_b1	15.9	20.2	12.8	8.9	10.9	41.8
anime_b5	14.4	18.3	12.6	8.6	10.0
anime_b1_n5	15.0	18.4	12.7	8.9	10.1
anime_b5_n5	14.4	18.1	12.5	8.6	10.0
anime_b1_nt	14.4	18.7	11.4	8.3	10.1
anime_b5_nt	13.4	17.5	11.4	8.1	9.6

b1	15.6	20.1	11.8	8.8	10.5	11.5
b5	15.2	19.8	11.6	8.8	10.7
b1_nt	15.6	20.1	11.9	8.7	10.5
b5_nt	15.3	19.4	11.8	8.6	10.5

b1 beam_size = 1
b5 beam_size = 5
n5 no_repeat_ngram_size = 5
nt <|notimestamps|>
Anime sets are worse compared to anime - whisper but better than turbo (out of domain).
273 videos from TEDxJP - 10K with youtube subtitles for long - form transcription using faster - whisper.
Slightly worse than turbo. Kotoba/anime - whisper is not trained for long - form.

Validation set

Used only for hyperparameter optimization.

	bluearchive	genshin5.1	nekopara	genshin	starrail	reazon	jsut	cv8	cv19	jsl	loopers	tedx10
[large - v3_b1](https://huggingface.co/openai/whisper - large - v3)	12.2	10.1	70.8	11.9	10.0	16.0	7.1	8.6	15.1	12.2		7.7
large - v3_b5	11.0	10.0	63.7	11.6	9.8	14.1	7.1	8.3	14.8	11.0
[large - v2_b1](https://huggingface.co/openai/whisper - large - v2)		14.4	103.4	18.3	12.9	31.6	8.2	9.8	18.5	18.0		8.0
large - v2_b5		12.7	100.9	16.8	12.9	28.0	8.0	9.5	17.5	16.2
[turbo_b1](https://huggingface.co/openai/whisper - large - v3 - turbo)	12.8	11.1	72.3	11.6	11.1	11.6	7.3	9.6	17.5	12.0	28.0	7.9
turbo_b5	10.4	10.0	64.3	12.0	10.2	10.4	7.2	9.1	16.6	10.8	20.2	8.8
[kotoba - v1_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - v1.0)	8.5	9.4	27.8	9.9	10.3	12.7	8.4	9.5	17.1	12.2		34.9
kotoba - v1_b5	8.4	9.3	27.8	9.8	10.3	12.3	8.3	9.3	16.7	12.1
[kotoba - v2_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - v2.0)	8.5	9.6	27.7	10.2	10.4	11.6	8.2	9.2	16.9	12.3		25.3
kotoba - v2_b5	8.6	9.5	27.7	10.1	10.5	11.4	8.2	9.0	16.6	12.2
[kotoba - bi_b1](https://huggingface.co/kotoba - tech/kotoba - whisper - bilingual - v1.0)	8.9	10.1	28.1	10.5	10.8	17.5	9.1	9.8	17.5	12.7		27.8
kotoba - bi_b5	8.8	10.0	28.0	10.5	10.7	17.1	9.1	9.6	17.2	12.6
[anime_b1](https://huggingface.co/litagin/anime - whisper)	7.5	11.5	24.7	11.0	11.2	30.1	8.0	10.0	19.1	9.0	18.9	32.0
anime_b5	7.2	10.4	22.0	10.3	10.4	26.6	7.8	9.8	18.8	8.5	15.3	51.8

b1	6.9	6.3	22.8	6.7	7.4	16.2	7.1	8.9	17.1	8.5	14.7	8.2
b5	7.5	6.2	22.8	6.6	7.3	15.7	7.0	8.7	17.0	8.5	14.5	9.1

bluearchive.wiki: Beam 5 performs worse due to the extra usage of kana. Is it learned from MiHoYo games?
genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
nekopara: Hallucination test. Anime would be better if not for the increased hallucination. OpenAI is unusable.
genshin/starrail: Mostly in the train set.
reazon: Significantly higher CER from transcribing background/secondary audio.
jsut: Surprisingly good?
cv8: cv19 train includes some of cv8 test.
cv19: No contamination, struggles with accents.
jsl: Anime set.
loopers: Anime set, has hallucination - prone audio.
tedxjp: 10 - video subset. See comments in the test set. b1 = batched, b5 = sequential, beam_size = 1, temperature = 0, condition_on_previous_text = False

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご