whisper-large-v3-turbo-swiss-german开源模型 - 高效将瑞士德语语音转录为标准德语文本

首页

Whisper Large V3 Turbo Swiss German

由 Flurin17 开发

针对瑞士德语自动语音识别优化的Whisper模型，可将瑞士德语语音转录为标准德语文本

语音识别

Transformers

支持多种语言开源协议:Apache-2.0 #瑞士德语转标准德语 #多方言语音识别 #议会语音转录

下载量 154

发布时间 : 5/22/2025

模型简介

本模型是对OpenAI的Whisper Large V3 Turbo进行微调后的版本，专门针对瑞士德语（Schweizerdeutsch）的自动语音识别进行了优化。该模型可将瑞士德语语音转录为标准德语文本。

模型特点

瑞士德语方言支持

支持所有主要瑞士德语方言，包括阿尔高州、伯尔尼州、巴塞尔州等地区方言

高质量转录

在350多小时高质量瑞士德语语音数据上微调，提供准确的语音转文本能力

时间戳功能

支持单词级和句子级的时间戳输出，便于音频对齐分析

批量处理能力

支持批量音频文件处理，提高大规模转录效率

模型能力

瑞士德语语音识别

方言到标准德语转换

音频时间戳标记

批量语音转录

使用案例

语音转录

议会记录转录

将瑞士议会中的瑞士德语演讲转录为标准德语文本

方言研究

用于语言学研究中瑞士德语方言的分析和记录

媒体处理

广播内容转录

将瑞士德语广播节目自动转录为文本

🚀 Whisper Large V3 Turbo - 瑞士德语微调版

本模型是对OpenAI的 Whisper Large V3 Turbo 进行微调后的版本，专门针对瑞士德语（Schweizerdeutsch） 的自动语音识别进行了优化。该模型可将瑞士德语语音转录为标准德语文本。评估工作仍待完成。

🚀 快速开始

基础用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe a Swiss German audio file
result = pipe("path/to/swiss_german_audio.wav")
print(result["text"])

高级用法

批量处理

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(audio_files, batch_size=8)

for result in results:
    print(result["text"])

获取时间戳

# Get word-level timestamps
result = pipe("swiss_german_audio.wav", return_timestamps="word")
print(result["chunks"])

# Get sentence-level timestamps  
result = pipe("swiss_german_audio.wav", return_timestamps=True)
print(result["chunks"])

模型与处理器的高级用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess audio
audio_array, sampling_rate = librosa.load("swiss_german_audio.wav", sr=16000)

inputs = processor(
    audio_array,
    sampling_rate=sampling_rate,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch_dtype)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(**inputs)

# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

✨ 主要特性

专门针对瑞士德语自动语音识别进行微调。
可将瑞士德语语音转录为标准德语文本。

📦 安装指南

文档未提供安装步骤，故跳过此章节。

📚 详细文档

模型描述

属性	详情
基础模型	`openai/whisper-large-v3-turbo`
语言	瑞士德语方言 → 标准德语文本
模型大小	8.09亿参数
许可证	Apache 2.0
微调来源	openai/whisper-large-v3-turbo

性能表现

该模型在瑞士德语自动语音识别任务中达到了先进水平：

单词错误率 (WER): %
字符错误率 (CER): %
训练数据: 350 多小时的瑞士德语语音

训练数据

本模型在一个全面的瑞士德语语音数据集上进行了微调，包括：

SwissDial-Zh v1.1：24 小时平衡的瑞士德语方言
瑞士议会语料库 V2 (SPC)：293 小时的议会演讲数据
所有瑞士德语方言测试集：13 小时，具有代表性的方言分布
ArchiMob 版本 2：70 小时

总训练数据：350 多小时 高质量的瑞士德语语音及标准德语转录。

支持的方言

该模型支持所有主要的瑞士德语方言：

阿尔高州 (AG)
伯尔尼州 (BE)
巴塞尔州 (BS)
格劳宾登州 (GR)
卢塞恩州 (LU)
圣加仑州 (SG)
瓦莱州 (VS)
苏黎世州 (ZH)

训练细节

训练超参数

学习率：2e-5
批量大小：每个设备 24（训练），每个设备 4（评估）
梯度累积步数：2
训练轮数：3
权重衰减：0.005
热身比例：0.03
精度：bfloat16
优化器：AdamW

训练基础设施

硬件：4 块 NVIDIA A100 GPU（每块 80GB）
计算平台：Azure 机器学习
训练时间：约 5 小时
框架：🤗 Transformers，PyTorch

数据处理

训练数据通过以下流程进行处理：

音频重采样至 16kHz
对数梅尔频谱特征提取（128 个梅尔频段）
文本归一化和分词
动态批量处理，按序列长度分组

与其他模型的比较

模型	单词错误率 (WER)	字符错误率 (CER)	参数数量
whisper-large-v3-turbo-swiss-german	%	****	8.09亿
whisper-large-v3-turbo (零样本)		%	8.09亿

局限性和偏差

领域：主要在朗读语音和议会程序上进行训练。
方言：在不同的瑞士德语方言上性能可能有所不同。
音频质量：在干净、高质量的音频录制上表现最佳。
说话人人口统计学特征：训练数据可能无法完全代表所有说话人群体。
转录风格：输出标准德语文本，而非方言转录。

模型卡片作者

Flurin17 - 模型开发和微调

引用

如果您在研究中使用此模型，请引用：

@misc{whisper-large-v3-turbo-swiss-german-2024,
  author = {Flurin17},
  title = {Whisper Large V3 Turbo Fine-tuned for Swiss German},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Flurin17/whisper-large-v3-turbo-swiss-german}
}

同时，也请考虑引用原始的 Whisper 论文：

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

以及用于训练的瑞士德语数据集：

@article{dogan2021swissdial,
  title={SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German},
  author={Dogan-Schönberger, Pelin and Mäder, Julian and Hofmann, Thomas},
  journal={arXiv preprint arXiv:2103.11401},
  year={2021}
}

@inproceedings{samardzic2016archimob,
  title={ArchiMob - A Corpus of Spoken Swiss German},
  author={Samardžić, Tanja and Scherrer, Yves and Glaser, Elvira},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  pages={4061--4066},
  year={2016},
  url={https://aclanthology.org/L16-1641}
}

@article{scherrer2019digitising,
  title={Digitising Swiss German: how to process and study a polycentric spoken language},
  author={Scherrer, Yves and Samardžić, Tanja and Glaser, Elvira},
  journal={Language Resources and Evaluation},
  volume={53},
  pages={735--769},
  year={2019},
  doi={10.1007/s10579-019-09457-5}
}

@article{pluss2022sds200,
  title={SDS-200: A Swiss German speech to standard German text corpus},
  author={Plüss, Michel and Hürlimann, Manuela and Cuny, Marc and Stöckli, Alla and Kapotis, Nikolaos and Hartmann, Julia and Ulasik, Malgorzata Anna and Scheller, Christian and Schraner, Yanick and Jain, Amit and Deriu, Jan and Cieliebak, Mark and Vogel, Manfred},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={3250--3256},
  year={2022},
  address={Marseille, France},
  publisher={European Language Resources Association}
}

@article{pluss2021spc,
  title={Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus},
  author={Plüss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal={arXiv preprint arXiv:2010.02810},
  year={2020}
}

@article{pluss2023stt4sg,
  title={STT4SG-350: A Speech Corpus for Swiss German with Standard German Translations},
  author={Plüss, Michel and Neukom, Lukas and Scheller, Christian and Vogel, Manfred},
  journal={arXiv preprint arXiv:2305.13179},
  year={2023}
}

致谢

OpenAI 提供原始的 Whisper 模型
Hugging Face 提供 Transformers 库和模型托管服务
瑞士德语语音数据集贡献者 提供高质量的训练数据：
- SwissDial-Zh v1.1：Pelin Dogan-Schönberger、Julian Mäder、Thomas Hofmann（苏黎世联邦理工学院）
- 瑞士议会语料库 V2 (SPC)：瑞士西北应用科学与艺术大学
- SDS-200 语料库：研究社区提供全面的瑞士德语方言覆盖
- ArchiMob 语料库：Tanja Samardžić、Yves Scherrer、Elvira Glaser（苏黎世大学）

许可证

本模型根据 Apache 2.0 许可证发布。原始的 Whisper 模型也遵循 Apache 2.0 许可证。

技术规格

属性	详情
架构	Transformer 编码器 - 解码器
输入	16kHz 单声道音频
输出	标准德语文本
上下文长度	30 秒
采样率	16000 Hz
特征提取	128 个梅尔频率频段
词汇表大小	51865 个标记