Speaker Diarization

Developed by pyannote

Speaker diarization model based on pyannote.audio 2.1.1, used for automatic detection of speaker changes and overlap speech in audio

Speaker Analysis Open Source License:MIT #Overlap speech detection #Speaker diarization #Real-time processing

Downloads 910.93k

Release Time : 3/2/2022

Model Overview

This model is an end-to-end speaker diarization pipeline that can automatically detect speaker changes, identify overlap speech, and complete segmentation tasks without manually specifying the number of speakers.

Model Features

Fully automatic processing

Completes segmentation without manual voice activity detection or specifying the number of speakers

Overlap speech detection

Accurately identifies and processes speech segments with overlapping speakers

Speaker count adaptation

Automatically adapts to different numbers of speakers, also supports manually specifying speaker count range

High performance

Excellent performance on multiple benchmark datasets, with a real-time factor of approximately 2.5%

Model Capabilities

Speaker diarization

Speaker change detection

Voice activity detection

Overlap speech detection

Automatic speech recognition assistance

Use Cases

Meeting transcription

Meeting transcription speaker diarization

Automatically identifies speech segments from different speakers in meeting recordings

DER of 18.91% on AMI dataset

Media analysis

Broadcast program speaker analysis

Analyzes speaker changes and overlap situations in broadcast programs

DER of 20.82% on This American Life dataset

Speech recognition preprocessing

ASR system preprocessing

Provides speaker diarization information for automatic speech recognition systems

tags:

pyannote
pyannote-audio
pyannote-audio-pipeline
audio
voice
speech
speaker
speaker-diarization
speaker-change-detection
voice-activity-detection
overlapped-speech-detection
automatic-speech-recognition datasets:
ami
dihard
voxconverse
aishell
repere
voxceleb license: mit extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers apply for grants to improve it further. If you are an academic researcher, please cite the relevant papers in your own publications using the model. If you work for a company, please consider contributing back to pyannote.audio development (e.g. through unrestricted gifts). We also provide scientific consulting services around speaker diarization and machine listening." extra_gated_fields: Company/university: text Website: text I plan to use this model for (task, type of audio data, etc): text

Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.

🎹 Speaker diarization

Relies on pyannote.audio 2.1.1: see installation instructions.

TL;DR

# 1. visit hf.co/pyannote/speaker-diarization and accept user conditions
# 2. visit hf.co/pyannote/segmentation and accept user conditions
# 3. visit hf.co/settings/tokens to create an access token
# 4. instantiate pretrained speaker diarization pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
                                    use_auth_token="ACCESS_TOKEN_GOES_HERE")


# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced usage

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

Benchmark

Real-time factor

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 1.5 minutes to process a one hour conversation.

Accuracy

This pipeline is benchmarked on a growing collection of datasets.

Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File-level evaluation
AISHELL-4	14.09	5.17	3.27	5.65	RTTM	eval
Albayzin (RTVE 2022)	25.60	5.58	6.84	13.18	RTTM	eval
AliMeeting (channel 1)	27.42	4.84	14.00	8.58	RTTM	eval
AMI (headset mix, only_words)	18.91	4.48	9.51	4.91	RTTM	eval
AMI (array1, channel 1, only_words)	27.12	4.11	17.78	5.23	RTTM	eval
CALLHOME (part2)	32.37	6.30	13.72	12.35	RTTM	eval
DIHARD 3 (Full)	26.94	10.50	8.41	8.03	RTTM	eval
Ego4D v1 (validation)	63.99	3.91	44.42	15.67	RTTM	eval
REPERE (phase 2)	8.17	2.23	2.49	3.45	RTTM	eval
This American Life	20.82	2.03	11.89	6.90	RTTM	eval
VoxConverse (v0.3)	11.24	4.42	2.88	3.94	RTTM	eval

Technical report

This report describes the main principles behind version 2.1 of pyannote.audio speaker diarization pipeline.
It also provides recipes explaining how to adapt the pipeline to your own set of annotated data. In particular, those are applied to the above benchmark and consistently leads to significant performance improvement over the above out-of-the-box performance.

Citations

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご