whisper-large-v3-distil-multi7-v0.2開源模型 - 免費支持7種歐語自動語音識別與語碼轉換

首頁

Whisper Large V3 Distil Multi7 V0.2

由bofenghuang開發

一個多語言蒸餾版Whisper模型，支持7種歐洲語言的自動語音識別，具有語碼轉換能力

語音識別

Transformers

支持多種語言開源協議:MIT #多語言語音識別 #語碼轉換 #輕量級蒸餾

下載量 119

發布時間 : 12/5/2024

模型概述

這是一個基於Whisper-Large-v3的蒸餾模型，專為7種歐洲語言（英語、法語、西班牙語、德語、意大利語、葡萄牙語和荷蘭語）的自動語音識別優化。模型具有2個解碼器層，特別支持語碼轉換功能，能自動檢測並處理多語言混合的語音輸入。

模型特點

多語言支持

支持7種歐洲語言的語音識別（英語、法語、西班牙語、德語、意大利語、葡萄牙語和荷蘭語）

語碼轉換能力

能夠自動檢測語音中的語言變化並生成相應的語言標記，實現無縫的多語言轉錄

高效蒸餾架構

僅保留2個解碼器層，在保持良好性能的同時提高推理效率

模型能力

自動語音識別

多語言轉錄

語碼轉換檢測

語音到文本轉換

使用案例

多語言轉錄

多語言會議記錄

自動轉錄包含多種語言的會議錄音

能準確識別語言切換並生成相應語言的文本

多語言媒體內容處理

處理包含多種語言的播客、視頻等媒體內容

生成帶語言標記的轉錄文本

語音分析

多語言語音數據分析

分析包含多種語言的語音數據集

提供準確的文本轉錄用於後續分析

🚀 Whisper-Large-V3-Distil-Multi7-v0.2

這是一個多語言蒸餾版的Whisper模型，具有2個解碼器層，支持7種歐洲語言：英語、法語、西班牙語、德語、意大利語、葡萄牙語和荷蘭語。

該模型是在作者對Distil-Large-v3.5的研究工作中訓練得到的。

其顯著特點是原生支持代碼切換。該模型能夠在單段轉錄中切換語言，當檢測到語言變化時，會自動生成新的語言標記（如下例所示）。

在訓練過程中，<|yue|>語言標記被重新用作自動語言檢測標記，從而在推理過程中實現代碼切換。若要使用此功能，只需將語言參數設置為cantonese（默認使用）。

該模型的性能低於單語言蒸餾版本和Whisper-Large-v3-Turbo。未來的工作應探索更好的訓練方法，並可能納入更多數據，以有效地將多語言能力壓縮到單個模型中。

🚀 快速開始

安裝依賴

確保你已經安裝了必要的庫：

pip install torch datasets transformers

運行示例代碼

以下是使用該模型的示例代碼：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# 加載模型
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# 示例音頻
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# 真實文本
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# 提取特徵
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# 生成標記
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# 將標記轉換為文本
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# 查看生成的標記
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

✨ 主要特性

多語言支持：支持英語、法語、西班牙語、德語、意大利語、葡萄牙語和荷蘭語。
代碼切換：原生支持在單段轉錄中切換語言。

📦 安裝指南

pip install torch datasets transformers

💻 使用示例

基礎用法

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# 加載模型
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# 示例音頻
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# 真實文本
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# 提取特徵
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# 生成標記
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# 將標記轉換為文本
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# 查看生成的標記
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

📚 詳細文檔

評估結果

英語

模型	LIUM_tedlium	mcv17	voxpopuli	fleurs	kensho_spgispeech	librispeech-test_clean	librispeech-test_other	speechcolab_gigaspeech
openai/whisper-large-v3	10.58	10.13	8.93	5.72	2.95	1.87	3.58	10.07
openai/whisper-large-v3-turbo	10.20	11.74	11.78	6.13	2.95	1.98	3.94	10.11
distil-whisper/distil-large-v3	8.93	12.41	7.72	7.59	3.25	2.42	5.11	10.08
distil-whisper/distil-large-v3.5	8.65	11.07	7.54	6.74	2.86	2.28	4.94	9.84
bofenghuang/whisper-large-v3-distil-multi4-v0.2	8.88	11.33	7.60	6.97	3.03	2.51	5.24	10.12
bofenghuang/whisper-large-v3-distil-multi7-v0.2	9.36	11.32	7.65	7.02	2.99	2.46	5.24	10.06

法語

模型	mcv17	mls	voxpopuli	mtedx	af_accented	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	10.98	4.69	11.15	8.67	7.51	5.4	9.87	8.97	9	8.01
openai/whisper-large-v3-turbo	12.41	5.1	12.21	9.87	8.37	5.48	10.12	9	8.49	8.39
bofenghuang/whisper_large_v3_distil_fr_v0.2	11.1	5	10.68	8.75	7.09	6.35	9.44	9.84	8.94	8.93
bofenghuang/whisper-large-v3-distil-multi4-v0.2	11.96	6.04	11.07	9.16	7.99	7.10	10.42	12.61	9.06	11.75
bofenghuang/whisper-large-v3-distil-multi7-v0.2	12.19	6.2	11.29	9.13	8.26	7.17	10.04	12.26	8.93	11.56

西班牙語

模型	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	4.91	3.97	11.06	6.52	4.22	10.85	10.36	5.90	5.22
openai/whisper-large-v3-turbo	5.74	4.41	16.02	6.66	4.59	11.55	10.68	6.46	5.41
bofenghuang/whisper-large-v3-distil-multi4-v0.2	5.58	4.34	8.52	7.43	5.20	11.26	13.43	5.69	8.95
bofenghuang/whisper-large-v3-distil-multi7-v0.2	5.70	4.35	8.55	7.56	5.15	11.45	13.54	5.84	8.27

德語

模型	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	6.11	5.60	17.75	19.63	5.92	11.21	10.35	17.64	17.76
openai/whisper-large-v3-turbo	7.45	6.43	20.48	20.00	6.45	10.57	9.70	18.04	18.37
bofenghuang/whisper-large-v3-distil-multi4-v0.2	7.31	6.45	12.41	21.48	8.20	11.04	13.55	19.54	21.76
bofenghuang/whisper-large-v3-distil-multi7-v0.2	7.57	6.67	12.42	21.95	8.28	11.21	13.84	19.90	21.67

意大利語

模型	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	5.71	9.58	28.45	7.21	4.28	6.95	6.37	6.83	7.28
openai/whisper-large-v3-turbo	6.77	10.64	30.69	7.41	4.69	6.88	6.52	7.98	7.37
bofenghuang/whisper_large_v3_distil_it_v0.2	6.15	9.22	17.27	7.52	5.26	6.06	6.99	7.84	8.42
bofenghuang/whisper-large-v3-distil-multi7-v0.2	6.78	11.42	17.53	8.07	5.68	7.04	9.51	7.51	10.47

葡萄牙語

模型	mcv17	mls	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	6.76	7.04	8.91	5.86	12.11	12.39	8.70	8.98
openai/whisper-large-v3-turbo	7.66	6.64	8.84	6.11	12.42	11.62	10.97	9.04
bofenghuang/whisper-large-v3-distil-multi7-v0.2	8.31	6.75	10.11	7.10	12.74	14.97	9.64	11.78

荷蘭語

模型	mcv17	mls	voxpopuli	fleurs
openai/whisper-large-v3	4.51	66.95	23.35	6.99
openai/whisper-large-v3-turbo	6.16	52.37	27.42	7.59
bofenghuang/whisper-large-v3-distil-multi7-v0.2	6.76	14.82	14.92	10.86