Open-source speaker diarization-optimized model - Automatically detect speaker changes and segment audio clips

Speaker Diarization Optimized

Developed by G-Root

The speaker diarization pipeline of Pyannote.audio, used to automatically detect speaker changes in audio and segment speech segments.

Speaker Analysis Open Source License:MIT #Speaker diarization #Overlapping speech detection #Multi-scenario adaptation

Downloads 349

Release Time : 1/25/2024

Model Overview

This is an audio processing pipeline for speaker diarization, which can automatically detect speaker changes in audio, identify overlapping speech, and output speaker diarization results. It supports mono audio sampled at 16kHz and can automatically handle downmixing and resampling of stereo/multi-channel audio.

Model Features

Pure PyTorch implementation

Removed the problematic onnxruntime dependency and runs entirely with PyTorch, simplifying deployment and potentially accelerating inference.

Automatic processing

Fully automated processing without manual speech activity detection or specifying the number of speakers.

Multi-format support

Supports outputting diarization results in RTTM format for easy subsequent processing and analysis.

GPU acceleration

Supports running on GPU to accelerate processing.

Model Capabilities

Speaker diarization

Speech activity detection

Overlapping speech detection

Automatic speaker counting

Audio downmixing processing

Audio resampling

Use Cases

Meeting recording

Meeting recording segmentation

Automatically segment different speakers in meeting recordings.

Improve the efficiency of meeting recording and reduce manual transcription time.

Media analysis

Radio program analysis

Analyze host switches and guest speeches in radio programs.

Help content analysts quickly understand the program structure.

Speech research

Speech database annotation

Automatically add speaker labels to speech databases.

Significantly reduce the workload of manual annotation.

🚀 Speaker diarization 3.1

This open - source pipeline offers speaker diarization capabilities. It resolves the onnxruntime issue in the previous version, runs on pure PyTorch for easier deployment and potentially faster inference, and requires pyannote.audio version 3.1 or higher.

Using this open - source pipeline in production?
Make the most of it thanks to our consulting services.

🚀 Quick Start

This pipeline is the same as [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.1) except it removes the [problematic](https://github.com/pyannote/pyannote - audio/issues/1537) use of onnxruntime. Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference. It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [Annotation](http://pyannote.github.io/pyannote - core/structure.html#annotation) instance:

Stereo or multi - channel audio files are automatically downmixed to mono by averaging the channels.
Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

📦 Installation

Install [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.1 with pip install pyannote.audio
Accept [pyannote/segmentation - 3.0](https://hf.co/pyannote/segmentation - 3.0) user conditions
Accept [pyannote/speaker - diarization - 3.1](https://hf.co/pyannote - speaker - diarization - 3.1) user conditions
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Processing from memory

Pre - loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

📚 Documentation

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

No manual voice activity detection (as is sometimes the case in the literature)
No manual number of speakers (though it is possible to provide it to the pipeline)
No fine - tuning of the internal models nor tuning of the pipeline hyper - parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

No forgiveness collar
Evaluation of overlapped speech

Property	Details
Model Type	Speaker diarization pipeline
Training Data	Not specified in the provided README

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File - level evaluation
AISHELL - 4	12.2	3.8	4.4	4.0	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)
AliMeeting (channel 1)	24.4	4.4	10.0	10.0	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)
AMI (headset mix, [only_words)](https://github.com/BUTSpeechFIT/AMI - diarization - setup)	18.8	3.6	9.5	5.7	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)
AMI (array1, channel 1, [only_words)](https://github.com/BUTSpeechFIT/AMI - diarization - setup)	22.4	3.8	11.2	7.5	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI - SDM.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AMI - SDM.SpeakerDiarization.Benchmark.test.eval)
AVA - AVD	50.0	10.8	15.7	23.4	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AVA - AVD.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/AVA - AVD.SpeakerDiarization.Benchmark.test.eval)
DIHARD 3 (Full)	21.7	6.2	8.1	7.3	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)
[MSDWild](https://x - lance.github.io/MSDWILD/)	25.3	5.8	8.0	11.5	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)
[REPERE (phase 2)](https://islrn.org/resources/360 - 758 - 359 - 485 - 0/)	7.8	1.8	2.6	3.5	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)
VoxConverse (v0.3)	11.3	4.1	3.4	3.8	[RTTM](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)	[eval](https://huggingface.co/pyannote/speaker - diarization - 3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)

📄 License

The pipeline uses the MIT license.

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open - source, we will occasionnally email you about premium pipelines and paid services around pyannote.

Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご