Speaker-diarization-3.1 Open-source Audio Model - Free Deployment for Automatic Detection and Segmentation of Audio Speakers

Speaker Diarization 3.1

Developed by pyannote

An audio processing model for speaker segmentation that can automatically detect and segment different speakers in audio.

Speaker Analysis Open Source License:MIT #Multi-speaker segmentation #Automatic speech recognition #Real-time audio processing

Downloads 11.7M

Release Time : 11/16/2023

Model Overview

This model accepts single-channel audio sampled at 16kHz and outputs speaker segmentation results. It supports automatic downmixing and resampling, eliminating the need for manual voice activity detection or specifying the number of speakers.

Model Features

Pure PyTorch implementation

Removes the problematic use of onnxruntime, simplifies deployment, and may accelerate inference.

Automatic processing

Automatically processes stereo/multi-channel audio and different sampling rates without manual preprocessing.

Speaker number control

Allows specifying the number of speakers or providing upper and lower limits to improve segmentation accuracy.

Progress monitoring

Supports monitoring the processing progress through hooks.

Model Capabilities

Speaker segmentation

Speaker change detection

Voice activity detection

Overlapping speech detection

Automatic speech recognition assistance

Use Cases

Meeting minutes

Meeting minutes segmentation

Automatically identify the time periods of different speakers in the meeting recording

Achieved a segmentation error rate of 12.2% on the AISHELL-4 dataset

Media analysis

Radio program analysis

Analyze the speech time distribution of different hosts and guests in the radio program

Achieved a segmentation error rate of 7.8% on the REPERE dataset

Speech transcription

Multi-speaker transcription assistance

Provide speaker segmentation information for the automatic speech recognition system

🚀 Speaker diarization 3.1

This open - source pipeline focuses on speaker diarization. It solves the problem of accurately identifying different speakers in an audio file. Compared with its predecessor, it removes the problematic use of onnxruntime, running both speaker segmentation and embedding in pure PyTorch, which eases deployment and may speed up inference.

🚀 Quick Start

Using this open - source model in production? Consider switching to pyannoteAI for better and faster options.

This pipeline is the same as pyannote/speaker-diarization-3.0 except it removes the problematic use of onnxruntime. Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference. It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an Annotation instance:

Stereo or multi - channel audio files are automatically downmixed to mono by averaging the channels.
Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

✨ Features

Pure PyTorch: Runs speaker segmentation and embedding in pure PyTorch, removing the use of onnxruntime.
Automatic Pre - processing: Automatically downmixes multi - channel audio to mono and resamples audio to 16kHz.
Benchmarked Performance: Has been benchmarked on a large collection of datasets with a strict DER setup.

📦 Installation

Install pyannote.audio 3.1 with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Processing from memory

Pre - loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

📚 Documentation

This pipeline has been benchmarked on a large collection of datasets. Processing is fully automatic:

No manual voice activity detection (as is sometimes the case in the literature)
No manual number of speakers (though it is possible to provide it to the pipeline)
No fine - tuning of the internal models nor tuning of the pipeline hyper - parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

No forgiveness collar
Evaluation of overlapped speech

Benchmark	DER%	FA%	Miss%	Conf%	Expected output	File - level evaluation
AISHELL - 4	12.2	3.8	4.4	4.0	RTTM	eval
AliMeeting (channel 1)	24.4	4.4	10.0	10.0	RTTM	eval
AMI (headset mix, only_words)	18.8	3.6	9.5	5.7	RTTM	eval
AMI (array1, channel 1, only_words)	22.4	3.8	11.2	7.5	RTTM	eval
AVA - AVD	50.0	10.8	15.7	23.4	RTTM	eval
DIHARD 3 (Full)	21.7	6.2	8.1	7.3	RTTM	eval
MSDWild	25.3	5.8	8.0	11.5	RTTM	eval
REPERE (phase 2)	7.8	1.8	2.6	3.5	RTTM	eval
VoxConverse (v0.3)	11.3	4.1	3.4	3.8	RTTM	eval

📄 License

This project is licensed under the MIT license.

📚 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご