whisper-large-v3-turbo-swiss-german開源模型 - 高效將瑞士德語語音轉錄為標準德語文本

首頁

Whisper Large V3 Turbo Swiss German

由Flurin17開發

針對瑞士德語自動語音識別優化的Whisper模型，可將瑞士德語語音轉錄為標準德語文本

語音識別

Transformers

支持多種語言開源協議:Apache-2.0 #瑞士德語轉標準德語 #多方言語音識別 #議會語音轉錄

下載量 154

發布時間 : 5/22/2025

模型概述

本模型是對OpenAI的Whisper Large V3 Turbo進行微調後的版本，專門針對瑞士德語（Schweizerdeutsch）的自動語音識別進行了優化。該模型可將瑞士德語語音轉錄為標準德語文本。

模型特點

瑞士德語方言支持

支持所有主要瑞士德語方言，包括阿爾高州、伯爾尼州、巴塞爾州等地區方言

高質量轉錄

在350多小時高質量瑞士德語語音數據上微調，提供準確的語音轉文本能力

時間戳功能

支持單詞級和句子級的時間戳輸出，便於音頻對齊分析

批量處理能力

支持批量音頻文件處理，提高大規模轉錄效率

模型能力

瑞士德語語音識別

方言到標準德語轉換

音頻時間戳標記

批量語音轉錄

使用案例

語音轉錄

議會記錄轉錄

將瑞士議會中的瑞士德語演講轉錄為標準德語文本

方言研究

用於語言學研究中瑞士德語方言的分析和記錄

媒體處理

廣播內容轉錄

將瑞士德語廣播節目自動轉錄為文本

🚀 Whisper Large V3 Turbo - 瑞士德語微調版

本模型是對OpenAI的 Whisper Large V3 Turbo 進行微調後的版本，專門針對瑞士德語（Schweizerdeutsch） 的自動語音識別進行了優化。該模型可將瑞士德語語音轉錄為標準德語文本。評估工作仍待完成。

🚀 快速開始

基礎用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe a Swiss German audio file
result = pipe("path/to/swiss_german_audio.wav")
print(result["text"])

高級用法

批量處理

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(audio_files, batch_size=8)

for result in results:
    print(result["text"])

獲取時間戳

# Get word-level timestamps
result = pipe("swiss_german_audio.wav", return_timestamps="word")
print(result["chunks"])

# Get sentence-level timestamps  
result = pipe("swiss_german_audio.wav", return_timestamps=True)
print(result["chunks"])

模型與處理器的高級用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess audio
audio_array, sampling_rate = librosa.load("swiss_german_audio.wav", sr=16000)

inputs = processor(
    audio_array,
    sampling_rate=sampling_rate,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch_dtype)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(**inputs)

# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

✨ 主要特性

專門針對瑞士德語自動語音識別進行微調。
可將瑞士德語語音轉錄為標準德語文本。

📦 安裝指南

文檔未提供安裝步驟，故跳過此章節。

📚 詳細文檔

模型描述

屬性	詳情
基礎模型	`openai/whisper-large-v3-turbo`
語言	瑞士德語方言 → 標準德語文本
模型大小	8.09億參數
許可證	Apache 2.0
微調來源	openai/whisper-large-v3-turbo

性能表現

該模型在瑞士德語自動語音識別任務中達到了先進水平：

單詞錯誤率 (WER): %
字符錯誤率 (CER): %
訓練數據: 350 多小時的瑞士德語語音

訓練數據

本模型在一個全面的瑞士德語語音數據集上進行了微調，包括：

SwissDial-Zh v1.1：24 小時平衡的瑞士德語方言
瑞士議會語料庫 V2 (SPC)：293 小時的議會演講數據
所有瑞士德語方言測試集：13 小時，具有代表性的方言分佈
ArchiMob 版本 2：70 小時

總訓練數據：350 多小時 高質量的瑞士德語語音及標準德語轉錄。

支持的方言

該模型支持所有主要的瑞士德語方言：

阿爾高州 (AG)
伯爾尼州 (BE)
巴塞爾州 (BS)
格勞賓登州 (GR)
盧塞恩州 (LU)
聖加侖州 (SG)
瓦萊州 (VS)
蘇黎世州 (ZH)

訓練細節

訓練超參數

學習率：2e-5
批量大小：每個設備 24（訓練），每個設備 4（評估）
梯度累積步數：2
訓練輪數：3
權重衰減：0.005
熱身比例：0.03
精度：bfloat16
優化器：AdamW

訓練基礎設施

硬件：4 塊 NVIDIA A100 GPU（每塊 80GB）
計算平臺：Azure 機器學習
訓練時間：約 5 小時
框架：🤗 Transformers，PyTorch

數據處理

訓練數據通過以下流程進行處理：

音頻重採樣至 16kHz
對數梅爾頻譜特徵提取（128 個梅爾頻段）
文本歸一化和分詞
動態批量處理，按序列長度分組

與其他模型的比較

模型	單詞錯誤率 (WER)	字符錯誤率 (CER)	參數數量
whisper-large-v3-turbo-swiss-german	%	****	8.09億
whisper-large-v3-turbo (零樣本)		%	8.09億

侷限性和偏差

領域：主要在朗讀語音和議會程序上進行訓練。
方言：在不同的瑞士德語方言上性能可能有所不同。
音頻質量：在乾淨、高質量的音頻錄製上表現最佳。
說話人人口統計學特徵：訓練數據可能無法完全代表所有說話人群體。
轉錄風格：輸出標準德語文本，而非方言轉錄。

模型卡片作者

Flurin17 - 模型開發和微調

引用

如果您在研究中使用此模型，請引用：

@misc{whisper-large-v3-turbo-swiss-german-2024,
  author = {Flurin17},
  title = {Whisper Large V3 Turbo Fine-tuned for Swiss German},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Flurin17/whisper-large-v3-turbo-swiss-german}
}

同時，也請考慮引用原始的 Whisper 論文：

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

以及用於訓練的瑞士德語數據集：

@article{dogan2021swissdial,
  title={SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German},
  author={Dogan-Schönberger, Pelin and Mäder, Julian and Hofmann, Thomas},
  journal={arXiv preprint arXiv:2103.11401},
  year={2021}
}

@inproceedings{samardzic2016archimob,
  title={ArchiMob - A Corpus of Spoken Swiss German},
  author={Samardžić, Tanja and Scherrer, Yves and Glaser, Elvira},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  pages={4061--4066},
  year={2016},
  url={https://aclanthology.org/L16-1641}
}

@article{scherrer2019digitising,
  title={Digitising Swiss German: how to process and study a polycentric spoken language},
  author={Scherrer, Yves and Samardžić, Tanja and Glaser, Elvira},
  journal={Language Resources and Evaluation},
  volume={53},
  pages={735--769},
  year={2019},
  doi={10.1007/s10579-019-09457-5}
}

@article{pluss2022sds200,
  title={SDS-200: A Swiss German speech to standard German text corpus},
  author={Plüss, Michel and Hürlimann, Manuela and Cuny, Marc and Stöckli, Alla and Kapotis, Nikolaos and Hartmann, Julia and Ulasik, Malgorzata Anna and Scheller, Christian and Schraner, Yanick and Jain, Amit and Deriu, Jan and Cieliebak, Mark and Vogel, Manfred},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={3250--3256},
  year={2022},
  address={Marseille, France},
  publisher={European Language Resources Association}
}

@article{pluss2021spc,
  title={Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus},
  author={Plüss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal={arXiv preprint arXiv:2010.02810},
  year={2020}
}

@article{pluss2023stt4sg,
  title={STT4SG-350: A Speech Corpus for Swiss German with Standard German Translations},
  author={Plüss, Michel and Neukom, Lukas and Scheller, Christian and Vogel, Manfred},
  journal={arXiv preprint arXiv:2305.13179},
  year={2023}
}

致謝

OpenAI 提供原始的 Whisper 模型
Hugging Face 提供 Transformers 庫和模型託管服務
瑞士德語語音數據集貢獻者 提供高質量的訓練數據：
- SwissDial-Zh v1.1：Pelin Dogan-Schönberger、Julian Mäder、Thomas Hofmann（蘇黎世聯邦理工學院）
- 瑞士議會語料庫 V2 (SPC)：瑞士西北應用科學與藝術大學
- SDS-200 語料庫：研究社區提供全面的瑞士德語方言覆蓋
- ArchiMob 語料庫：Tanja Samardžić、Yves Scherrer、Elvira Glaser（蘇黎世大學）

許可證

本模型根據 Apache 2.0 許可證發佈。原始的 Whisper 模型也遵循 Apache 2.0 許可證。

技術規格

屬性	詳情
架構	Transformer 編碼器 - 解碼器
輸入	16kHz 單聲道音頻
輸出	標準德語文本
上下文長度	30 秒
採樣率	16000 Hz
特徵提取	128 個梅爾頻率頻段
詞彙表大小	51865 個標記