speaker - diarization - v1開源說話人分割模型 - 免費處理10秒單聲道音頻輸出結果

首頁

Speaker Diarization V1

由objects76開發

這是一個基於冪集多類交叉熵損失的說話人分割模型，能夠處理10秒單聲道音頻，輸出說話人分割結果。

說話人處理

PyTorch

開源協議:MIT #多說話人重疊檢測 #即時語音分割 #會議場景優化

下載量 13

發布時間 : 9/9/2024

模型概述

該模型主要用於音頻中的說話人分割、語音活動檢測和重疊語音檢測，支持多說話人場景下的語音分析。

模型特點

冪集多類編碼

使用冪集多類交叉熵損失進行訓練，能夠同時處理多個說話人的語音分割。

多說話人支持

能夠識別最多3個說話人及其重疊語音情況。

集成多種數據集

訓練數據整合了AISHELL、AliMeeting、AMI等多個知名數據集。

模型能力

說話人分割

語音活動檢測

重疊語音檢測

多說話人識別

使用案例

語音分析

會議記錄分析

自動識別會議錄音中不同發言人的語音段落

提高會議記錄效率，自動區分發言人

語音轉寫預處理

在語音識別前進行說話人分割

提高轉寫準確性，實現說話人標註

音頻處理

重疊語音檢測

識別音頻中多人同時說話的部分

幫助分析對話交互模式

🚀 "Powerset"說話人分割模型

本模型是一個開源的說話人分割模型，它可以對音頻進行處理，輸出說話人分離的結果。如果你在生產環境中使用此模型，建議考慮使用 pyannoteAI 以獲得更好更快的選擇。

🚀 快速開始

模型概述

此模型接收以 16kHz 採樣的 10 秒單聲道音頻，並將說話人分離結果輸出為一個 (num_frames, num_classes) 矩陣，其中 7 個類別分別為 非語音、說話人 #1、說話人 #2、說話人 #3、說話人 #1 和 #2、說話人 #1 和 #3 以及 說話人 #2 和 #3。

示例輸出

安裝要求

使用 pip install pyannote.audio 安裝 pyannote.audio 3.0。
接受 pyannote/segmentation-3.0 的用戶使用條件。
在 hf.co/settings/tokens 創建訪問令牌。

使用示例

基礎用法

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate)

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk,
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

實例化模型

# instantiate the model
from pyannote.audio import Model
model = Model.from_pretrained(
  "pyannote/segmentation-3.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

說話人分離

此模型本身不能對完整錄音進行說話人分離（它僅處理 10 秒的音頻塊）。請參考 pyannote/speaker-diarization-3.0 管道，該管道使用額外的說話人嵌入模型來執行完整錄音的說話人分離。

語音活動檢測

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

重疊語音檢測

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 詳細文檔

該模型背後的各種概念在這篇論文中有詳細描述。它由 Séverin Baroudi 使用 pyannote.audio 3.0.0 進行訓練，訓練集結合了 AISHELL、AliMeeting、AMI、AVA - AVD、DIHARD、Ego4D、MSDWild、REPERE 和 VoxConverse。

由 Alexis Plaquet 維護的配套倉庫還提供瞭如何在你自己的數據上訓練或微調此模型的說明。

📄 許可證

本模型使用 MIT 許可證。

📖 引用

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}