pyannote-segmentation開源說話人分割模型 - 處理10秒音頻識別多說話人及重疊情況

首頁

Pyannote Segmentation

由it-just-works開發

這是一個基於冪集編碼的說話人分割模型，能夠處理10秒音頻片段並識別多個說話人及其重疊情況。

說話人處理

PyTorch

開源協議:MIT #多說話人重疊檢測 #語音活動分割 #即時音頻處理

下載量 771

發布時間 : 4/10/2025

模型概述

該模型用於音頻中的說話人分割，可檢測最多3個說話人及其重疊情況，輸出7種可能的說話人組合狀態。

模型特點

冪集編碼

使用獨特的冪集編碼方式處理多說話人場景，可同時識別單個說話人和重疊說話人

多任務支持

同一模型可用於說話人分割、語音活動檢測和重疊語音檢測

高效處理

專為10秒音頻片段優化，適合即時或批量處理

模型能力

說話人分割

語音活動檢測

重疊語音檢測

多說話人識別

使用案例

會議記錄

會議發言記錄

自動識別會議中不同發言者及其發言時間

準確分割各發言者語音段

語音分析

重疊語音檢測

檢測對話中多人同時說話的情況

識別重疊語音段

🚀 "Powerset" 說話人分割模型

本項目是一個開源的說話人分割模型，它能夠對音頻進行處理，輸出說話人分離的結果。該模型以16kHz採樣的10秒單聲道音頻為輸入，輸出說話人分離矩陣，為音頻處理和分析提供了強大的支持。

🚀 快速開始

如果你在生產環境中使用這個開源模型，建議考慮切換到 pyannoteAI，以獲取更好、更快的選擇。

✨ 主要特性

此模型接收16kHz採樣的10秒單聲道音頻，並將說話人分離結果輸出為一個 (num_frames, num_classes) 矩陣。其中，7個類別分別為 非語音、說話人 #1、說話人 #2、說話人 #3、說話人 #1 和 #2、說話人 #1 和 #3 以及 說話人 #2 和 #3。

示例輸出

📦 安裝指南

使用 pip install pyannote.audio 安裝版本為 3.0 的 pyannote.audio。
接受 pyannote/segmentation-3.0 的用戶使用條件。
在 hf.co/settings/tokens 創建訪問令牌。

💻 使用示例

基礎用法

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate) 

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk, 
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

高級用法

說話人分離

此模型本身不能對完整錄音進行說話人分離（它僅處理10秒的音頻塊）。可參考 pyannote/speaker-diarization-3.0 管道，該管道使用額外的說話人嵌入模型來對完整錄音進行說話人分離。

語音活動檢測

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

重疊語音檢測

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 詳細文檔

該模型背後的各種概念在這篇論文中有詳細描述。

它由 Séverin Baroudi 使用 pyannote.audio 3.0.0 進行訓練，訓練集結合了 AISHELL、AliMeeting、AMI、AVA - AVD、DIHARD、Ego4D、MSDWild、REPERE 和 VoxConverse。

由 Alexis Plaquet 維護的配套倉庫還提供瞭如何在你自己的數據上訓練或微調此類模型的說明。

📄 許可證

本項目採用 MIT 許可證。

📚 引用

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}