🚀 "Powerset" Speaker Segmentation
This open - source model focuses on speaker segmentation. It takes 10 - second mono audio sampled at 16kHz as input and outputs speaker diarization in a (num_frames, num_classes) matrix. The 7 classes include non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3. It can be very useful for various audio processing tasks such as speaker diarization, voice activity detection, etc.
Using this open - source model in production?
Consider switching to pyannoteAI for better and faster options.

🚀 Quick Start
Prerequisites
- Install the necessary libraries and set up the environment as described in the requirements section.
Basic Usage
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate)
powerset_encoding = model(waveform)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
max_speakers_per_chunk,
max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)
✨ Features
- Speaker Segmentation: Ingests 10 - second mono audio at 16kHz and outputs speaker diarization in a multi - class matrix.
- Multiple Applications: Can be used for speaker diarization, voice activity detection, overlapped speech detection, etc.
📦 Installation
- Install [
pyannote.audio
](https://github.com/pyannote/pyannote - audio) 3.0
with pip install pyannote.audio
.
- Accept [
pyannote/segmentation - 3.0
](https://hf.co/pyannote/segmentation - 3.0) user conditions.
- Create access token at
hf.co/settings/tokens
.
💻 Usage Examples
Basic Usage
from pyannote.audio import Model
model = Model.from_pretrained(
"pyannote/segmentation - 3.0",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
Speaker Diarization
This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks).
See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.
Voice Activity Detection
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
Overlapped Speech Detection
from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
📚 Documentation
The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).
It has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0.0
using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.
This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.
📄 License
This project is licensed under the MIT license.
Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi - class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
⚠️ Important Note
The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.
Property |
Details |
Model Type |
"Powerset" speaker segmentation |
Training Data |
Combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse |