🚀 "Powerset" Speaker Segmentation
This open-source model is designed for speaker segmentation. It takes in 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a matrix. It can be used in various audio analysis tasks such as speaker diarization, voice activity detection, and overlapped speech detection.
Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.
✨ Features
- Speaker Segmentation: Ingests 10 seconds of 16kHz mono audio and outputs speaker diarization as a (num_frames, num_classes) matrix with 7 classes including non - speech and different speaker combinations.
- Multi - Encoding: Supports powerset multi - class encoding and can be converted to multi - label encoding.
- Multiple Applications: Can be used for voice activity detection and overlapped speech detection through pipelines.
📦 Installation
Requirements
- Install
pyannote.audio
3.0
with pip install pyannote.audio
.
- Accept
pyannote/segmentation-3.0
user conditions.
- Create access token at
hf.co/settings/tokens
.
💻 Usage Examples
Basic Usage
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate)
powerset_encoding = model(waveform)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
max_speakers_per_chunk,
max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)
Advanced Usage - Speaker Diarization
This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks).
See pyannote/speaker-diarization-3.0 pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.
Advanced Usage - Voice Activity Detection
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
Advanced Usage - Overlapped Speech Detection
from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
📚 Documentation
The various concepts behind this model are described in details in this paper.
It has been trained by Séverin Baroudi with pyannote.audio 3.0.0
using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.
This companion repository by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.
📄 License
This model is licensed under the MIT license.
📖 Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
📋 Model Information
Property |
Details |
Model Type |
"Powerset" speaker segmentation |
Training Data |
Combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse |
License |
MIT |
⚠️ Important Note
The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.
💡 Usage Tip
If you want to perform speaker diarization of full recordings, use pyannote/speaker-diarization-3.0 pipeline instead of this model alone.