whisper-large-v3-turbo-swiss-germanオープンソースモデル - スイスドイツ語の音声を標準ドイツ語テキストに効率的に文字起こしする

ホーム

Whisper Large V3 Turbo Swiss German

Flurin17によって開発

スイスドイツ語の自動音声認識に最適化されたWhisperモデルで、スイスドイツ語の音声を標準ドイツ語のテキストに文字起こしできます。

音声認識

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #スイスドイツ語から標準ドイツ語への変換 #多方言音声認識 #議会の音声文字起こし

ダウンロード数 154

リリース時間 : 5/22/2025

モデル概要

このモデルはOpenAIのWhisper Large V3 Turboを微調整したバージョンで、スイスドイツ語（Schweizerdeutsch）の自動音声認識に特化して最適化されています。このモデルはスイスドイツ語の音声を標準ドイツ語のテキストに文字起こしできます。

モデル特徴

スイスドイツ語の方言サポート

アルゴー州、ベルン州、バーゼル州などの主要なスイスドイツ語の方言をサポートします。

高品質の文字起こし

350時間以上の高品質なスイスドイツ語の音声データで微調整され、正確な音声からテキストへの変換能力を提供します。

タイムスタンプ機能

単語レベルと文レベルのタイムスタンプ出力をサポートし、オーディオのアライメント分析を容易にします。

バッチ処理能力

バッチオーディオファイルの処理をサポートし、大規模な文字起こしの効率を向上させます。

モデル能力

スイスドイツ語の音声認識

方言から標準ドイツ語への変換

オーディオのタイムスタンプ付け

バッチ音声文字起こし

使用事例

音声文字起こし

議会記録の文字起こし

スイス議会でのスイスドイツ語の演説を標準ドイツ語のテキストに文字起こしします。

方言研究

言語学の研究におけるスイスドイツ語の方言の分析と記録に使用します。

メディア処理

ラジオコンテンツの文字起こし

スイスドイツ語のラジオ番組を自動的にテキストに文字起こしします。

🚀 Whisper Large V3 Turbo - スイスドイツ語ファインチューニング版

このモデルは、OpenAIのWhisper Large V3 Turboを**スイスドイツ語（Schweizerdeutsch）**の自動音声認識に特化してファインチューニングしたバージョンです。このモデルは、スイスドイツ語の音声を標準ドイツ語のテキストに文字起こしします。評価は未実施です。

🚀 クイックスタート

このモデルは、スイスドイツ語の自動音声認識に特化しており、スイスドイツ語の音声を標準ドイツ語のテキストに変換します。以下に、使用方法の例を示します。

✨ 主な機能

OpenAIのWhisper Large V3 Turboをベースに、スイスドイツ語に特化してファインチューニングされています。
スイスドイツ語の音声を標準ドイツ語のテキストに文字起こしします。
スイスドイツ語の主要な方言をサポートしています。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを使用して、必要なライブラリをインストールしてください。

pip install transformers librosa torch

💻 使用例

基本的な使用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# スイスドイツ語の音声ファイルを文字起こしする
result = pipe("path/to/swiss_german_audio.wav")
print(result["text"])

高度な使用法

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# 音声をロードして前処理する
audio_array, sampling_rate = librosa.load("swiss_german_audio.wav", sr=16000)

inputs = processor(
    audio_array,
    sampling_rate=sampling_rate,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch_dtype)

# 文字起こしを生成する
with torch.no_grad():
    predicted_ids = model.generate(**inputs)

# 文字起こしをデコードする
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

📚 ドキュメント

モデルの説明

ベースモデル: openai/whisper-large-v3-turbo
言語: スイスドイツ語の方言 → 標準ドイツ語のテキスト
モデルサイズ: 809Mパラメータ
ライセンス: Apache 2.0
ファインチューニング元: openai/whisper-large-v3-turbo

性能

このモデルは、スイスドイツ語の自動音声認識タスクで最先端の性能を達成しています。

単語誤り率（WER）: %
文字誤り率（CER）: %
学習データ: 350時間以上のスイスドイツ語の音声

学習データ

このモデルは、以下のような包括的なスイスドイツ語の音声データセットを使用してファインチューニングされています。

SwissDial-Zh v1.1: 24時間のバランスの取れたスイスドイツ語の方言
Swiss Parliament Corpus V2 (SPC): 293時間の議会の演説データ
All Swiss German Dialects Test Set: 13時間の代表的な方言分布
ArchiMob Release 2: 70時間

合計学習データ: 350時間以上の高品質なスイスドイツ語の音声と標準ドイツ語の文字起こし。

サポートされる方言

このモデルは、主要なスイスドイツ語の方言をすべてサポートしています。

Aargau (AG)
Bern (BE)
Basel (BS)
Graubünden (GR)
Lucerne (LU)
St. Gallen (SG)
Valais (VS)
Zurich (ZH)

学習の詳細

学習ハイパーパラメータ

学習率: 2e-5
バッチサイズ: デバイスごとに24（学習）、デバイスごとに4（評価）
勾配蓄積ステップ: 2
エポック数: 3
重み減衰: 0.005
ウォームアップ比率: 0.03
精度: bfloat16
オプティマイザー: AdamW

学習インフラストラクチャ

ハードウェア: 4台のNVIDIA A100 GPU（各80GB）
コンピューティング: Azure Machine Learning
学習時間: ~5時間
フレームワーク: 🤗 Transformers, PyTorch

データ処理

学習データは、以下のパイプラインで処理されました。

音声を16kHzにリサンプリング
ログメルスペクトログラムの特徴抽出（128メルビン）
テキストの正規化とトークン化
シーケンス長のグルーピングによる動的バッチング

他のモデルとの比較

モデル	WER	CER	パラメータ
whisper-large-v3-turbo-swiss-german	%	****	809M
whisper-large-v3-turbo (zero-shot)		%	809M

制限事項とバイアス

ドメイン: 主に読み上げ音声と議会の議事録で学習されています。
方言: スイスドイツ語の方言によって性能が異なる場合があります。
音声品質: クリーンで高品質な音声録音で最適な性能を発揮します。
話者の統計情報: 学習データがすべての話者グループを完全に代表していない可能性があります。
文字起こしスタイル: 標準ドイツ語のテキストを出力し、方言の文字起こしではありません。

モデルカードの作成者

Flurin17 - モデルの開発とファインチューニング

引用

このモデルを研究で使用する場合は、以下のように引用してください。

@misc{whisper-large-v3-turbo-swiss-german-2024,
  author = {Flurin17},
  title = {Whisper Large V3 Turbo Fine-tuned for Swiss German},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Flurin17/whisper-large-v3-turbo-swiss-german}
}

また、元のWhisper論文も引用することを検討してください。

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

学習に使用したスイスドイツ語のデータセットも引用してください。

@article{dogan2021swissdial,
  title={SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German},
  author={Dogan-Schönberger, Pelin and Mäder, Julian and Hofmann, Thomas},
  journal={arXiv preprint arXiv:2103.11401},
  year={2021}
}

@inproceedings{samardzic2016archimob,
  title={ArchiMob - A Corpus of Spoken Swiss German},
  author={Samardžić, Tanja and Scherrer, Yves and Glaser, Elvira},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  pages={4061--4066},
  year={2016},
  url={https://aclanthology.org/L16-1641}
}

@article{scherrer2019digitising,
  title={Digitising Swiss German: how to process and study a polycentric spoken language},
  author={Scherrer, Yves and Samardžić, Tanja and Glaser, Elvira},
  journal={Language Resources and Evaluation},
  volume={53},
  pages={735--769},
  year={2019},
  doi={10.1007/s10579-019-09457-5}
}

@article{pluss2022sds200,
  title={SDS-200: A Swiss German speech to standard German text corpus},
  author={Plüss, Michel and Hürlimann, Manuela and Cuny, Marc and Stöckli, Alla and Kapotis, Nikolaos and Hartmann, Julia and Ulasik, Malgorzata Anna and Scheller, Christian and Schraner, Yanick and Jain, Amit and Deriu, Jan and Cieliebak, Mark and Vogel, Manfred},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={3250--3256},
  year={2022},
  address={Marseille, France},
  publisher={European Language Resources Association}
}

@article{pluss2021spc,
  title={Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus},
  author={Plüss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal={arXiv preprint arXiv:2010.02810},
  year={2020}
}

@article{pluss2023stt4sg,
  title={STT4SG-350: A Speech Corpus for Swiss German with Standard German Translations},
  author={Plüss, Michel and Neukom, Lukas and Scheller, Christian and Vogel, Manfred},
  journal={arXiv preprint arXiv:2305.13179},
  year={2023}
}

謝辞

OpenAI - 元のWhisperモデルの提供
Hugging Face - Transformersライブラリとモデルのホスティング
スイスドイツ語の音声データセットの貢献者 - 高品質な学習データの提供
- SwissDial-Zh v1.1: Pelin Dogan-Schönberger, Julian Mäder, Thomas Hofmann (ETH Zurich)
- Swiss Parliament Corpus V2 (SPC): FHNW University of Applied Sciences and Arts Northwestern Switzerland
- SDS-200 Corpus: 包括的なスイスドイツ語の方言カバレッジのための研究コミュニティ
- ArchiMob Corpus: Tanja Samardžić, Yves Scherrer, Elvira Glaser (University of Zurich)

📄 ライセンス

このモデルは、Apache 2.0ライセンスの下で公開されています。元のWhisperモデルもApache 2.0ライセンスの下にあります。

🔧 技術詳細

アーキテクチャ: Transformerエンコーダー - デコーダー
入力: 16kHzのモノラル音声
出力: 標準ドイツ語のテキスト
コンテキスト長: 30秒
サンプリングレート: 16,000 Hz
特徴抽出: 128メル周波数ビン
語彙サイズ: 51,865トークン