Open-source Speaker Diarization Model speaker-diarization-2.5 - Superior Performance for Precise Speaker Identification

Speaker Diarization 2.5

Developed by Willy030125

A speaker diarization model modified based on pyannote/speaker-diarization-3.0, using speechbrain/spkrec-ecapa-voxceleb for speaker embedding, with better performance in certain tests

Speaker Analysis Open Source License:MIT #Speaker diarization #Overlapping speech detection #Automatic speaker counting

Downloads 26

Release Time : 3/24/2025

Model Overview

Used for speaker segmentation and change detection in audio, supporting automatic voice activity detection, overlapping speech detection, and automatic speaker counting

Model Features

Automatic speaker counting

No need to manually specify the number of speakers, the model can automatically detect

Improved speaker embedding

Uses speechbrain/spkrec-ecapa-voxceleb for speaker embedding, with better performance in certain scenarios

Fully automatic processing

No manual voice activity detection or hyperparameter tuning required

GPU acceleration support

Supports GPU processing with a real-time factor of about 2.5%

Model Capabilities

Speaker diarization

Speaker change detection

Voice activity detection

Overlapping speech detection

Automatic speaker counting

Use Cases

Meeting transcription

Meeting transcription analysis

Automatically identifies speech segments from different speakers in meetings

DER 12.3% (AISHELL-4 dataset)

Speech transcription

Automatic speech recognition preprocessing

Provides speaker segmentation information for ASR systems

Media analysis

Broadcast program analysis

Analyzes speaking patterns of different hosts and guests in broadcast programs

DER 7.8% (REPERE dataset)

🚀 Speaker diarization 2.5

This is a speaker diarization pipeline modified from pyannote/speaker-diarization-3.0, offering efficient and accurate speaker segmentation and identification.

🚀 Quick Start

Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.

This pipeline is modified from pyannote/speaker-diarization-3.0. It uses pyannote/segmentation-3.0 for Speaker Segments and the Speaker Embedding: speechbrain/spkrec-ecapa-voxceleb from pyannote/speaker-diarization@2.1. For some testings, Embeddings from speechbrain/spkrec-ecapa-voxceleb seems better to detect automatic number of speakers.

✨ Features

Speaker Segmentation: Utilizes pyannote/segmentation-3.0 for accurate speaker segments.
Speaker Embedding: Employs speechbrain/spkrec-ecapa-voxceleb for better speaker identification.
Automatic Speaker Detection: Can automatically detect the number of speakers in some cases.

📦 Installation

Install pyannote.audio 3.0 with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "Willy030125/speaker-diarization-2.5",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part). In other words, it takes approximately 1.5 minutes to process a one hour conversation.

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

📚 Documentation

This pipeline has been benchmarked on a large collection of datasets. Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Property	Details
Model Type	Speaker diarization pipeline
Training Data	Multiple datasets including AISHELL - 4, AliMeeting, AMI, etc.

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File - level evaluation
AISHELL-4	12.3	3.8	4.4	4.1	RTTM	eval
AliMeeting (channel 1)	24.3	4.4	10.0	9.9	RTTM	eval
AMI (headset mix, only_words)	19.0	3.6	9.5	5.9	RTTM	eval
AMI (array1, channel 1, only_words)	22.2	3.8	11.2	7.3	RTTM	eval
AVA-AVD	49.1	10.8	15.7	22.5	RTTM	eval
DIHARD 3 (Full)	21.7	6.2	8.1	7.3	RTTM	eval
MSDWild	24.6	5.8	8.0	10.7	RTTM	eval
REPERE (phase 2)	7.8	1.8	2.6	3.5	RTTM	eval
VoxConverse (v0.3)	11.3	4.1	3.4	3.8	RTTM	eval

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open - source, we will occasionnally email you about premium pipelines and paid services around pyannote.

📚 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご