đ Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large
This project presents a fine - tuned Japanese Wav2Vec2 model for speech recognition. It leverages the XLSR - 53 large architecture and is trained on multiple Japanese datasets, providing a reliable solution for Japanese speech recognition tasks.
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Japanese, using Common Voice, JVS and JSUT. When using this model, ensure that your speech input is sampled at 16kHz.
đģ Usage Examples
Basic Usage
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "ja"
MODEL_ID = "Ivydata/wav2vec2-large-xlsr-53-japanese"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference: ", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
đ Documentation
Test Result
The following table shows the Character Error Rate (CER) of the model tested on the TEDxJP-10K dataset.
Model |
CER |
Ivydata/wav2vec2-large-xlsr-53-japanese |
27.87% |
jonatasgrosman/wav2vec2-large-xlsr-53-japanese |
34.18% |
vumichien/wav2vec2-large-xlsr-japanese |
37.72% |
Test Inference Examples
Reference |
Prediction |
ãã 鏿ãããŽã§ã¯ãĒããŠãčããĻ鏿ããããŽã |
ãã æ´æŋ¯ãããŽã§ã¯ãĒããŠãčããĻæ´æããããŽã |
ããŽåˇ¨å¤§ãĒæ§é įŠãåŽåŽãĢäŊããã¨ãã§ããäēēé |
ããŽåˇ¨å¤§ãĒæ§é įŠãåŽåŽãĢäŊããã¨ãã§ããäēēé |
äŊãããåĢããĢãĒãŖãĻããŖãĻããžãŖãããã§ããã |
äŊãĢãããæ°æŽĩãĢãĒãŖãŖãĻããŖãĻããžãŖããããŠãã |
ãããĒåã ããããč¨ãããã¨ã¯įčãå¤ããã°čĒåãå¤ããŖãĻããã |
ããĒåãããããšãããã¨ã¯įčãå¤ããã°čĒåãå¤ããŖãĻãã |
ããããã¨ããŽč¨čãäŊŋãŖãϿǿĨãŽã¤ãĄãŧã¸ãåŊĸäŊãŖãĻãããã¨ãã§ãã㨠|
ããããã¨ããŽč¨čãäŊŋãŖãϿǿĨãŽã¤ãĄãŧãŧã¸ãåŊĸäŊãŖãĻããã¨ãã§ãã㨠|
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
If you want to cite this model, you can use the following BibTeX entry:
@misc{Ivydata2023-wav2vec2-xlsr53-large-japanese,
title={Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large},
author={Kosuke Suzuki},
howpublished={\url{https://huggingface.co/Ivydata/wav2vec2-large-xlsr-53-japanese/}},
year={2023}
}