Wav2vec2 Large Xlsr 53 Tw Gpt
A speech recognition model fine-tuned on Taiwan Mandarin (zh-tw) based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Downloads 47
Release Time : 3/2/2022
Model Overview
This is an automatic speech recognition (ASR) model optimized for Taiwan Mandarin, fine-tuned from Facebook's wav2vec2-large-xlsr-53 architecture and trained on the Common Voice zh-TW dataset
Model Features
Taiwan Mandarin Optimization
Specifically fine-tuned for the phonetic characteristics of Taiwan Mandarin
Language Model Fusion Support
Can be combined with language models like GPT or BERT to improve recognition accuracy
Efficient Inference
Achieves a CER of 18.36% on the Common Voice test set with relatively fast inference speed
Model Capabilities
Taiwan Mandarin speech recognition
Supports 16kHz sampling rate audio processing
Can be combined with language models
Use Cases
Speech to Text
Taiwan Mandarin Speech Transcription
Convert Taiwan Mandarin speech content into text
CER 18.36% (evaluated using GPT+beam search)
Voice Assistant
Taiwan Region Voice Command Recognition
Used to recognize voice commands in Taiwan Mandarin
🚀 Wav2Vec2-Large-XLSR-53-tw-gpt
This is a fine-tuned model based on facebook/wav2vec2-large-xlsr-53 for Mandarin Chinese in Taiwan, trained on the Common Voice dataset. It's designed for automatic speech recognition tasks.
📋 Metadata
Property | Details |
---|---|
Datasets | common_voice |
Tags | audio, automatic-speech-recognition, hf-asr-leaderboard, robust-speech-event, speech, xlsr-fine-tuning-week |
License | apache-2.0 |
Model Name | XLSR Wav2Vec2 Taiwanese Mandarin(zh-tw) by Voidful |
Results | Task: Speech Recognition (automatic-speech-recognition); Dataset: Common Voice zh-TW; Metrics: Test CER = 18.36 |
🚀 Quick Start
This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on the zh-tw dataset from Common Voice. When using this model, ensure that your speech input is sampled at 16kHz.
💻 Usage Examples
🔗 Colab trial
Basic Usage
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
AutoTokenizer,
AutoModelWithLMHead
)
import torch
import re
import sys
model_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
chars_to_ignore_regex = r"[¥•"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·.﹑︰〈〉─《﹖﹣﹂﹁﹔!?。。"#$%&'()*+,﹐-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏..!\"#$%&()*+,\-.\:;<=>?@\[\]\\\/^_`{|}~]"
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
tokenizer = AutoTokenizer.from_pretrained("ckiplab/gpt2-base-chinese")
gpt_model = AutoModelWithLMHead.from_pretrained("ckiplab/gpt2-base-chinese").to(device)
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
def load_file_to_data(file):
batch = {}
speech, _ = torchaudio.load(file)
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
batch["sampling_rate"] = resampler.new_freq
return batch
def predict(data):
features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
decoded_results = []
for logit in logits:
pred_ids = torch.argmax(logit, dim=-1)
mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
vocab_size = logit.size()[-1]
voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
gpt_input = torch.cat((torch.tensor([tokenizer.cls_token_id]).to(device),pred_ids[pred_ids>0]), 0)
gpt_prob = torch.nn.functional.softmax(gpt_model(gpt_input).logits, dim=-1)[:voice_prob.size()[0],:]
comb_pred_ids = torch.argmax(gpt_prob*voice_prob, dim=-1)
decoded_results.append(processor.decode(comb_pred_ids))
return decoded_results
Prediction Example
predict(load_file_to_data('voice file path'))
📚 Documentation
Evaluation
The model can be evaluated on the zh-tw test data of Common Voice as follows. The CER calculation refers to https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese.
Environment Setup
!pip install editdistance
!pip install torchaudio
!pip install datasets transformers
Evaluation without LM
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
import torch
import re
import sys
from transformers import AutoTokenizer, AutoModelWithLMHead
from datasets import Audio
from math import log
model_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
chars_to_ignore_regex = r"[¥•"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·.﹑︰〈〉─《﹖﹣﹂﹁﹔!?。。"#$%&'()*+,﹐-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏..!\"#$%&()*+,\-.\:;<=>?@\[\]\\\/^_`{|}~]"
tokenizer = AutoTokenizer.from_pretrained("ckiplab/gpt2-base-chinese")
lm_model = AutoModelWithLMHead.from_pretrained("ckiplab/gpt2-base-chinese").to(device)
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
ds = load_dataset("common_voice", 'zh-TW', split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def map_to_array(batch):
audio = batch["audio"]
batch["speech"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["sampling_rate"] = audio["sampling_rate"]
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
return batch
ds = ds.map(map_to_array)
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=3, remove_columns=list(ds.features.keys()))
def cer_cal(groundtruth, hypothesis):
err = 0
tot = 0
for p, t in zip(hypothesis, groundtruth):
err += float(ed.eval(p.lower(), t.lower()))
tot += len(t)
return err / tot
print("CER: {:2f}".format(100 * cer_cal(result["target"],result["predicted"])))
CER: 28.70
.
TIME: 04:08 min
Evaluation with GPT
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
import torch
import re
import sys
from transformers import AutoTokenizer, AutoModelWithLMHead
from datasets import Audio
from math import log
model_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
chars_to_ignore_regex = r"[¥•"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·.﹑︰〈〉─《﹖﹣﹂﹁﹔!?。。"#$%&'()*+,﹐-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏..!\"#$%&()*+,\-.\:;<=>?@\[\]\\\/^_`{|}~]"
tokenizer = AutoTokenizer.from_pretrained("ckiplab/gpt2-base-chinese")
lm_model = AutoModelWithLMHead.from_pretrained("ckiplab/gpt2-base-chinese").to(device)
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
ds = load_dataset("common_voice", 'zh-TW', split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def map_to_array(batch):
audio = batch["audio"]
batch["speech"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["sampling_rate"] = audio["sampling_rate"]
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
return batch
ds = ds.map(map_to_array)
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
decoded_results = []
for logit in logits:
pred_ids = torch.argmax(logit, dim=-1)
mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
vocab_size = logit.size()[-1]
voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
lm_input = torch.cat((torch.tensor([tokenizer.cls_token_id]).to(device),pred_ids[pred_ids>0]), 0)
lm_prob = torch.nn.functional.softmax(lm_model(lm_input).logits, dim=-1)[:voice_prob.size()[0],:]
comb_pred_ids = torch.argmax(lm_prob*voice_prob, dim=-1)
decoded_results.append(processor.decode(comb_pred_ids))
batch["predicted"] = decoded_results
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=3, remove_columns=list(ds.features.keys()))
def cer_cal(groundtruth, hypothesis):
err = 0
tot = 0
for p, t in zip(hypothesis, groundtruth):
err += float(ed.eval(p.lower(), t.lower()))
tot += len(t)
return err / tot
print("CER: {:2f}".format(100 * cer_cal(result["target"],result["predicted"])))
CER 25.70
.
TIME: 06:04 min
Evaluation with GPT + beam search
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
import torch
import re
import sys
from transformers import AutoTokenizer, AutoModelWithLMHead
from datasets import Audio
from math import log
model_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
chars_to_ignore_regex = r"[¥•"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·.﹑︰〈〉─《﹖﹣﹂﹁﹔!?。。"#$%&'()*+,﹐-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏..!\"#$%&()*+,\-.\:;<=>?@\[\]\\\/^_`{|}~]"
tokenizer = AutoTokenizer.from_pretrained("ckiplab/gpt2-base-chinese")
lm_model = AutoModelWithLMHead.from_pretrained("ckiplab/gpt2-base-chinese").to(device)
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
ds = load_dataset("common_voice", 'zh-TW', split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def map_to_array(batch):
audio = batch["audio"]
batch["speech"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
batch["sampling_rate"] = audio["sampling_rate"]
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
return batch
ds = ds.map(map_to_array)
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
decoded_results = []
for logit in logits:
sequences = [[[], 1.0]]
pred_ids = torch.argmax(logit, dim=-1)
mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
vocab_size = logit.size()[-1]
voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
while True:
all_candidates = list()
exceed = False
for seq in sequences:
tokens, score = seq
gpt_input = torch.tensor([tokenizer.cls_token_id]+tokens).to(device)
gpt_prob = torch.nn.functional.softmax(lm_model(gpt_input).logits, dim=-1)[:len(gpt_input),:]
if len(gpt_input) >= len(voice_prob):
exceed = True
comb_pred_ids = gpt_prob*voice_prob[:len(gpt_input)]
v,i = torch.topk(comb_pred_ids,50,dim=-1)
for tok_id,tok_prob in zip(i.tolist()[-1],v.tolist()[-1]):
candidate = [tokens + [tok_id], score + -log(tok_prob)]
all_candidates.append(candidate)
ordered = sorted(all_candidates, key=lambda tup: tup[1])
sequences = ordered[:10]
if exceed:
break
decoded_results.append(processor.decode(sequences[0][0]))
batch["predicted"] = decoded_results
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=3, remove_columns=list(ds.features.keys()))
def cer_cal(groundtruth, hypothesis):
err = 0
tot = 0
for p, t in zip(hypothesis, groundtruth):
err += float(ed.eval(p.lower(), t.lower()))
tot += len(t)
return err / tot
print("CER: {:2f}".format(100 * cer_cal(result["target"],result["predicted"])))
CER 18.36
.
Evaluation with BERT
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
import torch
import re
import sys
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-tw-gpt"
chars_to_ignore_regex = r"[¥•"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·.﹑︰〈〉─《﹖﹣﹂﹁﹔!?。。"#$%&'()*+,﹐-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏..!\"#$%&()*+,\-.\:;<=>?@\[\]\\\/^_`{|}~]"
📄 License
This project is licensed under the apache-2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models