🚀 "Powerset" Speaker Segmentation
This open - source model is designed for speaker segmentation. It takes 10 - second mono audio sampled at 16kHz as input and outputs speaker diarization results. It's a powerful tool in the field of audio processing, and its application can be further enhanced through our consulting services.
Using this open - source model in production?
Make the most of it thanks to our consulting services.
✨ Features
This model ingests 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a (num_frames, num_classes) matrix where the 7 classes are non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3.

💻 Usage Examples
Basic Usage
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate
powerset_encoding = model(waveform)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
max_speakers_per_chunk,
max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)
Advanced Usage
The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).
It has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0.0
using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.
This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.
📦 Installation
- Install [
pyannote.audio
](https://github.com/pyannote/pyannote - audio) 3.0
with pip install pyannote.audio
- Accept [
pyannote/segmentation - 3.0
](https://hf.co/pyannote/segmentation - 3.0) user conditions
- Create access token at
hf.co/settings/tokens
.
📚 Documentation
Speaker diarization
This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks).
See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.
Voice activity detection
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
Overlapped speech detection
from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
📄 License
This model is licensed under the MIT license.
The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.
📚 Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi - class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}