pyannote - segmentation - 30開源音頻處理模型，免費檢測語音活動與多說話人情況

首頁

Pyannote Segmentation 30

由collinbarnwell開發

這是一個用於音頻處理的說話人分割模型，能夠檢測語音活動、重疊語音和多個說話人。

說話人處理

PyTorch

開源協議:MIT #多說話人重疊檢測 #語音活動識別 #即時音頻處理

下載量 873

發布時間 : 2/9/2024

模型概述

該模型處理16kHz採樣的10秒單聲道音頻，輸出包含7個類別的說話人分割結果，支持語音活動檢測和重疊語音檢測。

模型特點

多說話人檢測

能夠同時檢測最多3個說話人及其重疊部分。

短時處理

專門優化用於處理10秒音頻片段的分割任務。

多任務輸出

同時支持語音活動檢測和重疊語音檢測任務。

模型能力

說話人分割

語音活動檢測

重疊語音檢測

多說話人識別

使用案例

會議記錄

會議發言人識別

自動識別會議錄音中的不同發言人及其發言時段

提高會議記錄效率，自動生成發言記錄

語音分析

重疊語音檢測

檢測對話中多人同時說話的情況

改善語音識別系統在重疊語音場景下的表現

🚀 "Powerset" 說話人分割模型

本項目提供了一個開源的說話人分割模型，它能夠對音頻中的說話人進行精準分割和識別。該模型以 16kHz 採樣的單聲道音頻為輸入，輸出說話人分割結果，可廣泛應用於語音處理、音頻分析等領域。

🚀 快速開始

若你在生產環境中使用此開源模型，可通過我們的諮詢服務充分發揮其價值。

✨ 主要特性

輸入輸出明確：該模型接收 10 秒、採樣率為 16kHz 的單聲道音頻，輸出說話人分割結果，以 (num_frames, num_classes) 矩陣形式呈現，其中 7 個類別分別為 非語音、說話人 #1、說話人 #2、說話人 #3、說話人 #1 和 #2、說話人 #1 和 #3 以及 說話人 #2 和 #3。
可視化示例：提供了輸出示例圖 Example output，方便用戶直觀瞭解模型輸出。

📦 安裝指南

使用 pip install pyannote.audio 安裝版本為 3.0 的 pyannote.audio。
接受 pyannote/segmentation-3.0 的用戶使用條件。
在 hf.co/settings/tokens 創建訪問令牌。

💻 使用示例

基礎用法

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate 

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk, 
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

說話人分割

此模型本身無法對完整錄音進行說話人分割（僅處理 10 秒的音頻塊）。可參考 pyannote/speaker-diarization-3.0 管道，該管道使用額外的說話人嵌入模型對完整錄音進行說話人分割。

語音活動檢測

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

重疊語音檢測

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 詳細文檔

該模型的相關概念在這篇論文中有詳細描述。

此模型由 Séverin Baroudi 使用 pyannote.audio 3.0.0 進行訓練，訓練數據集結合了 AISHELL、AliMeeting、AMI、AVA - AVD、DIHARD、Ego4D、MSDWild、REPERE 和 VoxConverse。

由 Alexis Plaquet 維護的配套倉庫還提供瞭如何在自己的數據上訓練或微調此模型的說明。

📄 許可證

本模型採用 MIT 許可證。

使用此模型時，收集的信息將有助於更好地瞭解 pyannote.audio 用戶群體，並幫助維護者進一步改進它。儘管此模型使用 MIT 許可證並將始終保持開源，但我們可能會偶爾通過電子郵件向你介紹 pyannote 相關的高級模型和付費服務。

📚 引用

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}