đ Whisper Large V3 Turbo - Japanese Anime Speech
This is a speech recognition model fine-tuned on Japanese anime speech, based on OpenAI's Whisper Large V3 Turbo. It is optimized for Japanese dialogues and expressions in anime, providing more accurate transcription of Japanese anime dialogues.
đ Quick Start
You can use the following code to directly transcribe Japanese anime speech with this model:
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="hhim8826/whisper-large-v3-turbo-ja")
result = asr("path/to/anime_audio.wav")
print(result["text"])
For more detailed usage examples:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa
processor = AutoProcessor.from_pretrained("hhim8826/whisper-large-v3-turbo-ja")
model = AutoModelForSpeechSeq2Seq.from_pretrained("hhim8826/whisper-large-v3-turbo-ja").to("cuda")
audio_file = 'anime_audio.wav'
audio_array, sampling_rate = librosa.load(audio_file, sr=16000)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt").to("cuda")
with torch.no_grad():
generated_ids = model.generate(inputs=inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
⨠Features
- Optimized for Anime: Specifically optimized for Japanese dialogues and expressions in anime, providing more accurate transcription.
- Adapted to Anime Characteristics: After training on the
hhim8826/japanese-anime-speech-v2-split
dataset, it can better handle the characteristics of anime speech, including special intonations, tones, and common anime terms.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="hhim8826/whisper-large-v3-turbo-ja")
result = asr("path/to/anime_audio.wav")
print(result["text"])
Advanced Usage
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa
processor = AutoProcessor.from_pretrained("hhim8826/whisper-large-v3-turbo-ja")
model = AutoModelForSpeechSeq2Seq.from_pretrained("hhim8826/whisper-large-v3-turbo-ja").to("cuda")
audio_file = 'anime_audio.wav'
audio_array, sampling_rate = librosa.load(audio_file, sr=16000)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt").to("cuda")
with torch.no_grad():
generated_ids = model.generate(inputs=inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
đ Documentation
Model Details
This model is fine-tuned from openai/whisper-large-v3-turbo
and is specifically used to recognize speech content in Japanese anime. It has been trained on the hhim8826/japanese-anime-speech-v2-split
dataset, enabling it to better handle the characteristics of anime speech, including special intonations, tones, and common anime terms.
Property |
Details |
Developer |
hhim8826 |
Model Type |
Automatic Speech Recognition (ASR) |
Language |
Japanese |
License |
Apache 2.0 |
Fine-tuned from |
openai/whisper-large-v3-turbo |
Downstream Applications
This model is suitable for:
- Automatic subtitle generation for anime videos
- Analysis of anime speech content
- Research on Japanese anime dialogues
- Auxiliary tools for Japanese anime translation
Training Details
Training Data
This model is trained on the hhim8826/japanese-anime-speech-v2-split
dataset, which contains speech segments from various Japanese anime and their corresponding transcriptions.
Training Process
The model starts from openai/whisper-large-v3-turbo
and is fine-tuned to adapt to the characteristics of anime speech. The training stops after an appropriate number of iterations to avoid overfitting.
Training Hyperparameters
- Learning Rate: 1e-5
- Training Batch Size: 16
- Training Steps: 4000
Evaluation Results
On the anime speech test set, this model shows improvements over the original Whisper model in the following aspects:
- Better handling of anime proper nouns and special terms
- Improved ability to recognize dialogues under the interference of background music/sound effects
- More accurate handling of the intonations and speaking styles unique to anime characters
Limitations
- Optimized mainly for Japanese anime, it may not perform as well as specialized models on other types of Japanese content.
- It may have insufficient recognition of some very niche or special anime vocabulary.
- It may still have difficulties in recognizing extremely fast or vague dialogues.
đ License
This model is licensed under the Apache 2.0 license.