lang-id-voxlingua107-ecapa開源語音模型 - 免費部署支持107種語言識別與向量提取

首頁

Lang Id Voxlingua107 Ecapa

由speechbrain開發

基於SpeechBrain框架和ECAPA-TDNN架構的語音語言識別模型，支持107種語言的識別和語音嵌入向量提取。

音頻分類

PyTorch

支持多種語言開源協議:Apache-2.0 #多語言語音識別 #107種語言支持 #ECAPA-TDNN架構

下載量 330.01k

發布時間 : 3/2/2022

模型概述

該模型採用ECAPA-TDNN架構，在VoxLingua107數據集上訓練，可用於語音語言識別或作為語音片段特徵提取器。支持16kHz採樣率的單聲道音頻輸入。

模型特點

多語言支持

支持107種語言的識別，涵蓋全球主要語言和部分小語種

雙重用途

既可直接用於語言識別，也可作為特徵提取器用於構建專用模型

高性能架構

採用ECAPA-TDNN架構，在VoxLingua107開發集上錯誤率僅為6.7%

自動音頻處理

內置音頻標準化功能，自動處理採樣率和聲道轉換

模型能力

語音語言識別

語音特徵提取

多語言處理

使用案例

語音處理

多語言語音分類

識別語音片段所屬的語言類別

在VoxLingua107開發集上錯誤率6.7%

語音特徵提取

提取語音片段的嵌入向量用於下游任務

256維特徵向量

內容管理

多語言內容分類

對用戶生成的多語言語音內容進行分類管理

🚀 VoxLingua107 ECAPA - TDNN 口語語言識別模型

這是一個基於 SpeechBrain 在 VoxLingua107 數據集上訓練的口語語言識別模型，能夠根據語音識別出對應的語言，涵蓋 107 種不同語言，為語言識別相關的下游任務提供了有力支持。

模型信息

屬性	詳情
模型類型	口語語言識別模型
訓練數據	VoxLingua107 數據集
指標	準確率

支持語言

multilingual、ab、af、am、ar、as、az、ba、be、bg、bi、bo、br、bs、ca、ceb、cs、cy、da、de、el、en、eo、es、et、eu、fa、fi、fo、fr、gl、gn、gu、gv、ha、haw、hi、hr、ht、hu、hy、ia、id、is、it、he、ja、jv、ka、kk、km、kn、ko、la、lm、ln、lo、lt、lv、mg、mi、mk、ml、mn、mr、ms、mt、my、ne、nl、nn、no、oc、pa、pl、ps、pt、ro、ru、sa、sco、sd、si、sk、sl、sn、so、sq、sr、su、sv、sw、ta、te、tg、th、tk、tl、tr、tt、uk、ud、uz、vi、war、yi、yo、zh

許可證

apache - 2.0

🚀 快速開始

本模型可以根據語音識別出對應的語言，支持 107 種不同語言。模型訓練使用的是採樣率為 16kHz（單聲道）的錄音，調用 classify_file 時，代碼會自動對音頻進行歸一化處理（即重採樣 + 單聲道選擇）。

安裝依賴

pip install git+https://github.com/speechbrain/speechbrain.git@develop

代碼示例

import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier
language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
# 從 Omniglot 下載泰語樣本並轉換為合適的格式
signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
prediction =  language_id.classify_batch(signal)
print(prediction)
#  (tensor([[-2.8646e+01, -3.0346e+01, -2.0748e+01, -2.9562e+01, -2.2187e+01,
#         -3.2668e+01, -3.6677e+01, -3.3573e+01, -3.2545e+01, -2.4365e+01,
#         -2.4688e+01, -3.1171e+01, -2.7743e+01, -2.9918e+01, -2.4770e+01,
#         -3.2250e+01, -2.4727e+01, -2.6087e+01, -2.1870e+01, -3.2821e+01,
#         -2.2128e+01, -2.2822e+01, -3.0888e+01, -3.3564e+01, -2.9906e+01,
#         -2.2392e+01, -2.5573e+01, -2.6443e+01, -3.2429e+01, -3.2652e+01,
#         -3.0030e+01, -2.4607e+01, -2.2967e+01, -2.4396e+01, -2.8578e+01,
#         -2.5153e+01, -2.8475e+01, -2.6409e+01, -2.5230e+01, -2.7957e+01,
#         -2.6298e+01, -2.3609e+01, -2.5863e+01, -2.8225e+01, -2.7225e+01,
#         -3.0486e+01, -2.1185e+01, -2.7938e+01, -3.3155e+01, -1.9076e+01,
#         -2.9181e+01, -2.2160e+01, -1.8352e+01, -2.5866e+01, -3.3636e+01,
#         -4.2016e+00, -3.1581e+01, -3.1894e+01, -2.7834e+01, -2.5429e+01,
#         -3.2235e+01, -3.2280e+01, -2.8786e+01, -2.3366e+01, -2.6047e+01,
#         -2.2075e+01, -2.3770e+01, -2.2518e+01, -2.8101e+01, -2.5745e+01,
#         -2.6441e+01, -2.9822e+01, -2.7109e+01, -3.0225e+01, -2.4566e+01,
#         -2.9268e+01, -2.7651e+01, -3.4221e+01, -2.9026e+01, -2.6009e+01,
#         -3.1968e+01, -3.1747e+01, -2.8156e+01, -2.9025e+01, -2.7756e+01,
#         -2.8052e+01, -2.9341e+01, -2.8806e+01, -2.1636e+01, -2.3992e+01,
#         -2.3794e+01, -3.3743e+01, -2.8332e+01, -2.7465e+01, -1.5085e-02,
#         -2.9094e+01, -2.1444e+01, -2.9780e+01, -3.6046e+01, -3.7401e+01,
#         -3.0888e+01, -3.3172e+01, -1.8931e+01, -2.2679e+01, -3.0225e+01,
#         -2.4995e+01, -2.1028e+01]]), tensor([-0.0151]), tensor([94]), ['th'])
# prediction[0] 張量中的分數可以解釋為給定話語屬於給定語言的對數似然（即越大越好）
# 可以使用以下方法獲取線性比例的似然：
print(prediction[1].exp())
#  tensor([0.9850])
# 識別出的語言 ISO 代碼在 prediction[3] 中給出
print(prediction[3])
#  ['th: Thai']

# 或者，使用話語嵌入提取器：
emb =  language_id.encode_batch(signal)
print(emb.shape)
# torch.Size([1, 1, 256])

若要在 GPU 上進行推理，在調用 from_hparams 方法時添加 run_opts={"device":"cuda"}。

⚠️ 重要提示

確保輸入張量符合預期的採樣率，特別是在使用 encode_batch 和 classify_batch 時。在數據集和此模型的默認設置中（見 label_encoder.txt），希伯來語使用的 ISO 語言代碼已過時（應為 he 而非 iw），爪哇語的 ISO 語言代碼不正確（應為 jv 而非 jw）。詳見 issue #2396。

✨ 主要特性

語言識別能力：能夠對 107 種不同語言的語音進行分類識別。
架構優化：採用 ECAPA - TDNN 架構，在嵌入層後使用更多全連接隱藏層，並使用交叉熵損失進行訓練，提升了提取的話語嵌入在下游任務中的性能。
音頻自動處理：代碼會自動對音頻進行歸一化處理（重採樣 + 單聲道選擇）。

📦 安裝指南

pip install git+https://github.com/speechbrain/speechbrain.git@develop

💻 使用示例

基礎用法

import torchaudio
from speechbrain.inference.classifiers import EncoderClassifier
language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
prediction =  language_id.classify_batch(signal)
print(prediction)

高級用法

# 在 GPU 上進行推理
language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp", run_opts={"device":"cuda"})
signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
prediction =  language_id.classify_batch(signal)
print(prediction)

📚 詳細文檔

預期用途

直接使用：可直接用於口語語言識別。
特徵提取：作為話語級特徵（嵌入）提取器，用於在自己的數據上創建專用的語言識別模型。

侷限性和偏差

小語種準確性：對小語種的識別準確性可能有限。
性別差異：由於 YouTube 數據中男性語音較多，對女性語音的識別效果可能不如男性。
口音影響：對於帶有外國口音的語音識別效果不佳。
特殊語音：對兒童語音和有語言障礙者的語音識別效果可能不好。

🔧 技術細節

本模型基於 SpeechBrain 在 VoxLingua107 數據集上進行訓練。使用 ECAPA - TDNN 架構，該架構此前用於說話人識別，本模型在嵌入層後使用了更多全連接隱藏層，並採用交叉熵損失進行訓練，提升了提取的話語嵌入在下游任務中的性能。訓練使用的錄音採樣率為 16kHz（單聲道）。

📄 許可證

本項目採用 apache - 2.0 許可證。

引用信息

引用 SpeechBrain

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}

引用 VoxLingua107

@inproceedings{valk2021slt,
  title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
  author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
  booktitle={Proc. IEEE SLT Workshop},
  year={2021},
}