🚀 Speaker diarization 3.1
This open - source pipeline offers speaker diarization capabilities. It resolves the onnxruntime
issue in the previous version, runs on pure PyTorch for easier deployment and potentially faster inference, and requires pyannote.audio version 3.1 or higher.
Using this open - source pipeline in production?
Make the most of it thanks to our consulting services.
🚀 Quick Start
This pipeline is the same as [pyannote/speaker - diarization - 3.0
](https://hf.co/pyannote/speaker - diarization - 3.1) except it removes the [problematic](https://github.com/pyannote/pyannote - audio/issues/1537) use of onnxruntime
. Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference. It requires pyannote.audio version 3.1 or higher.
It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [Annotation
](http://pyannote.github.io/pyannote - core/structure.html#annotation) instance:
- Stereo or multi - channel audio files are automatically downmixed to mono by averaging the channels.
- Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
📦 Installation
- Install [
pyannote.audio
](https://github.com/pyannote/pyannote - audio) 3.1
with pip install pyannote.audio
- Accept [
pyannote/segmentation - 3.0
](https://hf.co/pyannote/segmentation - 3.0) user conditions
- Accept [
pyannote/speaker - diarization - 3.1
](https://hf.co/pyannote - speaker - diarization - 3.1) user conditions
- Create access token at
hf.co/settings/tokens
.
💻 Usage Examples
Basic Usage
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
diarization = pipeline("audio.wav")
with open("audio.rttm", "w") as rttm:
diarization.write_rttm(rttm)
Advanced Usage
Processing on GPU
pyannote.audio
pipelines run on CPU by default. You can send them to GPU with the following lines:
import torch
pipeline.to(torch.device("cuda"))
Processing from memory
Pre - loading audio files in memory may result in faster processing:
waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
Monitoring progress
Hooks are available to monitor the progress of the pipeline:
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
diarization = pipeline("audio.wav", hook=hook)
Controlling the number of speakers
In case the number of speakers is known in advance, one can use the num_speakers
option:
diarization = pipeline("audio.wav", num_speakers=2)
One can also provide lower and/or upper bounds on the number of speakers using min_speakers
and max_speakers
options:
diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
📚 Documentation
This pipeline has been benchmarked on a large collection of datasets.
Processing is fully automatic:
- No manual voice activity detection (as is sometimes the case in the literature)
- No manual number of speakers (though it is possible to provide it to the pipeline)
- No fine - tuning of the internal models nor tuning of the pipeline hyper - parameters to each dataset
... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):
- No forgiveness collar
- Evaluation of overlapped speech
Property |
Details |
Model Type |
Speaker diarization pipeline |
Training Data |
Not specified in the provided README |
Benchmark |
DER% |
FA% |
Miss% |
Conf% |
Expected output |
File - level evaluation |
AISHELL - 4 |
12.2 |
3.8 |
4.4 |
4.0 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval) |
AliMeeting (channel 1) |
24.4 |
4.4 |
10.0 |
10.0 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval) |
AMI (headset mix, [only_words)](https://github.com/BUTSpeechFIT/AMI - diarization - setup) |
18.8 |
3.6 |
9.5 |
5.7 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval) |
AMI (array1, channel 1, [only_words)](https://github.com/BUTSpeechFIT/AMI - diarization - setup) |
22.4 |
3.8 |
11.2 |
7.5 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI - SDM.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI - SDM.SpeakerDiarization.Benchmark.test.eval) |
AVA - AVD |
50.0 |
10.8 |
15.7 |
23.4 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AVA - AVD.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AVA - AVD.SpeakerDiarization.Benchmark.test.eval) |
DIHARD 3 (Full) |
21.7 |
6.2 |
8.1 |
7.3 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval) |
[MSDWild](https://x - lance.github.io/MSDWILD/) |
25.3 |
5.8 |
8.0 |
11.5 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval) |
[REPERE (phase 2)](https://islrn.org/resources/360 - 758 - 359 - 485 - 0/) |
7.8 |
1.8 |
2.6 |
3.5 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval) |
VoxConverse (v0.3) |
11.3 |
4.1 |
3.4 |
3.8 |
[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm) |
[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval) |
📄 License
The pipeline uses the MIT license.
⚠️ Important Note
The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open - source, we will occasionnally email you about premium pipelines and paid services around pyannote.
Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}