s2t-wav2vec2-large-en-ca開源語音翻譯模型 - 免費實現英語到加泰羅尼亞語翻譯

首頁

S2t Wav2vec2 Large En Ca

由facebook開發

這是一個基於Transformer的端到端語音翻譯模型，專為英語到加泰羅尼亞語的語音翻譯任務設計。

語音識別

Transformers

支持多種語言開源協議:MIT #端到端語音翻譯 #Wav2Vec2編碼器 #英語-加泰羅尼亞語

下載量 35

發布時間 : 3/2/2022

模型概述

該模型採用預訓練的Wav2Vec2作為編碼器，搭配Transformer解碼器，能夠直接將英語語音翻譯為加泰羅尼亞語文本。

模型特點

端到端語音翻譯

直接從語音輸入生成目標語言文本，無需中間轉錄步驟

基於Wav2Vec2預訓練

利用大規模自監督預訓練的Wav2Vec2作為語音編碼器

Transformer架構

採用標準的Transformer解碼器進行序列生成

模型能力

英語語音識別

英語到加泰羅尼亞語翻譯

端到端語音翻譯

使用案例

語音翻譯

即時語音翻譯

將英語語音即時翻譯為加泰羅尼亞語文本

在CoVoST-V2測試集上達到34.1 BLEU分數

語音轉錄與翻譯

將英語語音內容轉錄並翻譯為加泰羅尼亞語

🚀 S2T2-Wav2Vec2-CoVoST2-EN-CA-ST

s2t-wav2vec2-large-en-ca 是一個經過訓練的端到端語音翻譯（ST）的語音轉文本Transformer模型。S2T2模型在論文 Large-Scale Self- and Semi-Supervised Learning for Speech Translation 中被提出，並在 Fairseq 中正式發佈。

🚀 快速開始

本模型可用於端到端的英語語音到加泰羅尼亞語文本的翻譯。你可以在模型中心查找其他S2T2檢查點。

✨ 主要特性

模型描述

S2T2是一個基於Transformer的序列到序列（語音編碼器 - 解碼器）模型，專為端到端自動語音識別（ASR）和語音翻譯（ST）而設計。它使用預訓練的 Wav2Vec2 作為編碼器，並使用基於Transformer的解碼器。該模型使用標準的自迴歸交叉熵損失進行訓練，並自迴歸地生成翻譯結果。

適用範圍和侷限性

此模型可用於將英語語音直接翻譯成加泰羅尼亞語文本。

評估結果

CoVoST - V2 英語到加泰羅尼亞語的測試結果（BLEU分數）：34.1。更多信息請查看官方論文，特別是表2的第10行。

💻 使用示例

基礎用法

由於這是一個標準的序列到序列Transformer模型，你可以使用 generate 方法，通過將語音特徵傳遞給模型來生成轉錄結果。你可以通過自動語音識別（ASR）管道直接使用該模型：

from datasets import load_dataset
from transformers import pipeline

librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-ca", feature_extractor="facebook/s2t-wav2vec2-large-en-ca")

translation = asr(librispeech_en[0]["file"])

高級用法

你也可以按以下步驟逐步使用該模型：

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
from datasets import load_dataset

import soundfile as sf
model = SpeechEncoderDecoder.from_pretrained("facebook/s2t-wav2vec2-large-en-ca")
processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-ca")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch
    
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

📚 詳細文檔

數據集

covost2
librispeech_asr

推理小部件示例

示例標題：Common Voice 1，音頻鏈接：https://cdn-media.huggingface.co/speech_samples/common_voice_en_18301577.mp3
示例標題：Common Voice 2，音頻鏈接：https://cdn-media.huggingface.co/speech_samples/common_voice_en_99989.mp3
示例標題：Common Voice 3，音頻鏈接：https://cdn-media.huggingface.co/speech_samples/common_voice_en_9999.mp3

BibTeX引用

@article{DBLP:journals/corr/abs-2104-06678,
  author    = {Changhan Wang and
               Anne Wu and
               Juan Miguel Pino and
               Alexei Baevski and
               Michael Auli and
               Alexis Conneau},
  title     = {Large-Scale Self- and Semi-Supervised Learning for Speech Translation},
  journal   = {CoRR},
  volume    = {abs/2104.06678},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.06678},
  archivePrefix = {arXiv},
  eprint    = {2104.06678},
  timestamp = {Thu, 12 Aug 2021 15:37:06 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-06678.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}