Open-source Speaker Diarization Model pyannote-speaker-diarization-endpoint - Automatically Detect Speaker Changes and Voice Activity in Audio

Pyannote Speaker Diarization Endpoint

Developed by philschmid

Speaker diarization model based on pyannote.audio 2.0 for automatic detection of speaker changes and speech activity in audio

Speaker Analysis Open Source License:MIT #Speaker diarization #Overlapping speech detection #Automatic speaker counting

Downloads 51

Release Time : 10/7/2022

Model Overview

This model is an end-to-end speaker diarization system capable of automatically detecting speaker changes, speech activity, and overlapping speech in audio, completing speaker diarization tasks without manual intervention.

Model Features

Fully automated processing

Performs segmentation without manual speech activity detection or specifying the number of speakers

Overlapping speech detection

Capable of detecting and handling overlapping speech scenarios

Speaker count adaptation

Automatically determines the number of speakers, also supports manual specification

High performance

Excellent performance across multiple benchmark datasets

Model Capabilities

Speaker diarization

Speech activity detection

Overlapping speech detection

Speaker change detection

Automatic speaker counting

Use Cases

Meeting transcription

Meeting transcription segmentation

Automatically segments different speakers in meeting recordings

Achieves 18.21% DER on the AMI dataset

Call recording analysis

Customer service call analysis

Automatically distinguishes between agent and customer speech segments

Achieves 30.24% DER on the CALLHOME dataset

Media content analysis

Interview program analysis

Automatically identifies hosts and guests in interview programs

Achieves 12.76% DER on the VoxConverse dataset

🚀 🎹 Speaker diarization

This project provides a speaker diarization solution relying on pyannote.audio 2.0, enabling automatic processing of audio files to distinguish different speakers.

🚀 Quick Start

Relies on pyannote.audio 2.0: see installation instructions.

💻 Usage Examples

Basic Usage

# load the pipeline from Hugginface Hub
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")

# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

If you feel adventurous, you can try and play with the various pipeline hyper-parameters.
For instance, one can use a more aggressive voice activity detection by increasing the value of segmentation_onset threshold:

hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)

📚 Documentation

Benchmark

Real-time factor

Real-time factor is around 5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 3 minutes to process a one hour conversation.

Accuracy

This pipeline is benchmarked on a growing collection of datasets.

Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File-level evaluation
AISHELL-4	14.61	3.31	4.35	6.95	RTTM	eval
AMI Mix-Headset only_words	18.21	3.28	11.07	3.87	RTTM	eval
AMI Array1-01 only_words	29.00	2.71	21.61	4.68	RTTM	eval
CALLHOME Part2	30.24	3.71	16.86	9.66	RTTM	eval
DIHARD 3 Full	20.99	4.25	10.74	6.00	RTTM	eval
REPERE Phase 2	12.62	1.55	3.30	7.76	RTTM	eval
VoxConverse v0.0.2	12.76	3.45	3.85	5.46	RTTM	eval

📄 Support

For commercial enquiries and scientific consulting, please contact me.
For technical questions and bug reports, please check pyannote.audio Github repository.

📚 Citations

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご