Speaker-diarization-3.1 Open-source Audio Processing Model - Free Speaker Segmentation, Speech Activity Detection, and Overlap Detection

Speaker Diarization 3.1

Developed by tensorlake

An audio processing model for speaker diarization and embedding, supporting automatic voice activity detection and overlapping speech detection.

Audio Processing Open Source License:MIT #Speaker Diarization #Pure PyTorch Inference #Automatic Voice Activity Detection

Downloads 393

Release Time : 7/25/2024

Model Overview

This model takes 16kHz sampled mono audio as input and outputs speaker diarization results, supporting automatic downmixing and resampling without requiring manual voice activity detection or speaker count specification.

Model Features

Pure PyTorch Implementation

Removes problematic onnxruntime usage, simplifying deployment and potentially accelerating inference.

Automatic Processing

Automatically handles stereo/multi-channel audio and varying sample rates without preprocessing.

Speaker Count Control

Supports specifying speaker count or setting upper/lower bounds.

Progress Monitoring

Allows monitoring pipeline processing progress via hooks.

Model Capabilities

Speaker Diarization

Voice Activity Detection

Overlapping Speech Detection

Speaker Change Detection

Automatic Speech Recognition Assistance

Use Cases

Meeting Transcription

Meeting Transcription Analysis

Automatically identifies speech segments from different speakers in meetings

Generates timestamped speaker diarization results

Media Production

Podcast/Interview Analysis

Automatically segments different speakers in podcasts or interviews

Generates RTTM format segmentation files

Speech Analysis

Voice Activity Detection

Detects speech activity regions in audio

Accurately identifies speech and non-speech segments

🚀 Speaker diarization 3.1

This pipeline performs speaker diarization, removing the problematic use of onnxruntime and running in pure PyTorch for easier deployment and potentially faster inference.

Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.

🚀 Quick Start

This pipeline is the same as pyannote/speaker-diarization-3.0 except it removes the problematic use of onnxruntime.
Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference.
It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an Annotation instance:

stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

📦 Installation

Install pyannote.audio 3.1 with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch
pipeline.to(torch.device("cuda"))

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
    diarization = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

📚 Documentation

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Property	Details
Benchmark	DER%
FA%	Miss%
Conf%	Expected output
File-level evaluation
AISHELL-4	12.2
3.8	4.4
4.0	RTTM
eval
AliMeeting (channel 1)	24.4
4.4	10.0
10.0	RTTM
eval
AMI (headset mix, only_words)	18.8
3.6	9.5
5.7	RTTM
eval
AMI (array1, channel 1, only_words)	22.4
3.8	11.2
7.5	RTTM
eval
AVA-AVD	50.0
10.8	15.7
23.4	RTTM
eval
DIHARD 3 (Full)	21.7
6.2	8.1
7.3	RTTM
eval
MSDWild	25.3
5.8	8.0
11.5	RTTM
eval
REPERE (phase 2)	7.8
1.8	2.6
3.5	RTTM
eval
VoxConverse (v0.3)	11.3
4.1	3.4
3.8	RTTM
eval

📄 License

This project is licensed under the MIT license.

📚 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご