Speaker Diarization 3.1

fatymatariqによって開発

Pyannoteオーディオスピーカー分割パイプライン、オーディオ内の異なるスピーカーを自動検出・分割するためのもの

話者の処理オープンソースライセンス:MIT #マルチスピーカー分割 #オーバーラップ音声検出 #純粋なPyTorch推論

ダウンロード数 1,120

リリース時間 : 11/21/2024

モデル概要

これはスピーカー分割のためのオーディオ処理パイプラインで、オーディオ内の異なるスピーカーを自動検出し分割することができ、16kHzサンプリングのモノラルオーディオ処理をサポートします。

モデル特徴

純粋なPyTorch実装

問題のあるonnxruntimeの使用を排除し、スピーカー分割と埋め込みを純粋なPyTorchで実行、デプロイを簡素化し推論を高速化する可能性あり

自動オーディオ処理

ステレオ/マルチチャンネルオーディオのダウンミックスや異なるサンプルレートオーディオのリサンプリングを自動処理

スピーカー数制御

スピーカー数の指定やスピーカー数の上限下限設定をサポート

包括的なベンチマークテスト

複数の公開データセットで厳格なベンチマークテストを実施、性能指標を透明に公開

モデル能力

スピーカー分割

スピーカー変更検出

音声活動検出

オーバーラップ音声検出

自動オーディオリサンプリング

マルチチャンネルオーディオ処理

使用事例

会議記録

会議発言記録

会議録音中の異なる発言者の時間帯を自動識別

タイムスタンプ付きのスピーカー分割結果を生成

メディア分析

インタビュー番組分析

インタビュー番組での司会者とゲストの発言時間分布を分析

詳細なスピーカー交替統計データを提供

音声処理

音声認識前処理

自動音声認識システムにスピーカー分割情報を提供

マルチスピーカーシナリオでのASRシステム精度向上

tags:

pyannote
pyannote-audio
pyannote-audio-pipeline
audio
voice
speech
speaker
speaker-diarization
speaker-change-detection
voice-activity-detection
overlapped-speech-detection
automatic-speech-recognition license: mit extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote." extra_gated_fields: Company/university: text Website: text

Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.

🎹 Speaker diarization 3.1

This pipeline is the same as pyannote/speaker-diarization-3.0 except it removes the problematic use of onnxruntime.
Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference.
It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an Annotation instance:

stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

Requirements

Install pyannote.audio 3.1 with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create access token at hf.co/settings/tokens.

Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

Benchmark

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File-level evaluation
AISHELL-4	12.2	3.8	4.4	4.0	RTTM	eval
AliMeeting (channel 1)	24.4	4.4	10.0	10.0	RTTM	eval
AMI (headset mix, only_words)	18.8	3.6	9.5	5.7	RTTM	eval
AMI (array1, channel 1, only_words)	22.4	3.8	11.2	7.5	RTTM	eval
AVA-AVD	50.0	10.8	15.7	23.4	RTTM	eval
DIHARD 3 (Full)	21.7	6.2	8.1	7.3	RTTM	eval
MSDWild	25.3	5.8	8.0	11.5	RTTM	eval
REPERE (phase 2)	7.8	1.8	2.6	3.5	RTTM	eval
VoxConverse (v0.3)	11.3	4.1	3.4	3.8	RTTM	eval

Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}