🚀 "Powerset" Speaker Segmentation
This open - source model ingests 10 - second mono audio sampled at 16kHz and outputs speaker diarization, offering solutions for various audio analysis tasks.
🚀 Quick Start
Requirements
- Install
pyannote.audio
3.0
with pip install pyannote.audio
.
- Accept
pyannote/segmentation-3.0
user conditions.
- Create an access token at
hf.co/settings/tokens
.
Usage
from pyannote.audio import Model
model = Model.from_pretrained(
"pyannote/segmentation-3.0",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
✨ Features
- Speaker Diarization: This model ingests 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a (num_frames, num_classes) matrix. The 7 classes are non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3.
- Voice Activity Detection: Can be used in a voice activity detection pipeline.
- Overlapped Speech Detection: Can be used in an overlapped speech detection pipeline.
📦 Installation
Install the necessary library using the following command:
pip install pyannote.audio
💻 Usage Examples
Basic Usage
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate)
powerset_encoding = model(waveform)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
max_speakers_per_chunk,
max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)
Advanced Usage
Speaker Diarization
This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks). See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.
Voice Activity Detection
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
Overlapped Speech Detection
from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
"min_duration_on": 0.0,
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
📚 Documentation
The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).
This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.
📄 License
This project is licensed under the MIT license.
📄 Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
⚠️ Important Note
Using this open - source model in production? Consider switching to pyannoteAI for better and faster options.
⚠️ Important Note
The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.
Property |
Details |
Tags |
pyannote, pyannote - audio, pyannote - audio - model, audio, voice, speech, speaker, speaker - diarization, speaker - change - detection, speaker - segmentation, voice - activity - detection, overlapped - speech - detection, resegmentation |
License |
MIT |
Inference |
false |