whisper-large-v3-narrow-accentオープンソースモデル - 16種類の英語アクセントを高精度に識別

ホーム

Whisper Large V3 Narrow Accent

tiantiafによって開発

Whisper-Large v3を基にした細粒度アクセント分類モデルで、16種類の英語アクセント認識をサポート

音声分類

Safetensors

英語オープンソースライセンス:Bsd-3-clause #多アクセント分類 #音声特徴抽出 #短音声最適化

ダウンロード数 237

リリース時間 : 5/22/2025

モデル概要

このモデルは細粒度アクセント分類手法を実装し、16種類の異なる英語アクセントタイプを識別可能で、音声特徴分析や話者分類タスクに適しています。

モデル特徴

細粒度アクセント分類

東アジア、イングランド、ゲルマン語系など16種類の異なる英語アクセントタイプを識別可能

合成音声認識特性

合成音声(TTS)サンプルに対して特別な認識モードを持ち、ゲルマン語系として認識されやすい

Whisper-Large v3ベース

強力なWhisper-Large v3ベースモデル上に構築され、優れた音声処理能力を継承

モデル能力

英語アクセント分類

音声特徴抽出

話者特徴分析

使用事例

音声分析

話者アクセント識別

話者の英語アクセントタイプを識別

16種類の異なる英語アクセントを正確に分類可能

音声特徴分析

音声中のアクセント特徴を抽出

話者特徴分析に利用可能

音声技術研究

音声モデルベンチマークテスト

音声ベースモデルベンチマークテストの一部として

標準化されたアクセント分類評価を提供

🚀 Whisper-Large v3 for Narrow Accent Classification

このモデルは、話者の狭いアクセント分類を行う機能を備えています。Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) で説明されているアクセント分類の実装が含まれています。

🚀 クイックスタート

このモデルの使用方法を以下に説明します。

リポジトリのダウンロード

git clone git@github.com:tiantiaf0627/vox-profile-release.git

パッケージのインストール

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

モデルの読み込み

# Load libraries
import torch
import torch.nn.functional as F
from src.model.accent.whisper_accent import WhisperWrapper

# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-narrow-accent").to(device)
model.eval()

予測

# Label List
english_accent_list = [
    'East Asia', 'English', 'Germanic', 'Irish', 
    'North America', 'Northern Irish', 'Oceania', 
    'Other', 'Romance', 'Scottish', 'Semitic', 'Slavic', 
    'South African', 'Southeast Asia', 'South Asia', 'Welsh'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embeddings = model(data, return_feature=True)
    
# Probability and output
accent_prob = F.softmax(logits, dim=1)
print(english_accent_list[torch.argmax(accent_prob).detach().cpu().item()])

✨ 主な機能

狭いアクセント分類の実装を含んでいます。
以下の英語のアクセントを分類できます。

[
  'East Asia', 'English', 'Germanic', 'Irish', 
  'North America', 'Northern Irish', 'Oceania', 
  'Other', 'Romance', 'Scottish', 'Semitic', 'Slavic', 
  'South African', 'Southeast Asia', 'South Asia', 'Welsh'
]

このモデルに関する観察結果

TTSサンプルは、ドイツ語系のアクセントとして認識される傾向が高いです。

ライブラリ: https://github.com/tiantiaf0627/vox-profile-release

📚 ドキュメント

モデルの説明

このモデルは、Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) で説明されている狭いアクセント分類の実装を含んでいます。

コンタクト

何か質問があれば、Tiantian Feng (tiantiaf@usc.edu) までご連絡ください。

引用

このモデルを使用する場合は、以下の論文を引用してください。

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}