Open-source segmentation audio processing model - Supports voice activity, overlap, and speaker segmentation detection

Home

Segmentation

Developed by pyannote

An audio processing model for voice activity detection, overlap detection, and speaker diarization

Speaker Analysis

PyTorch

Open Source License:MIT #Speaker Diarization #Overlap Detection #Voice Activity Detection

Downloads 9.2M

Release Time : 3/2/2022

Model Overview

This model is primarily designed for speaker diarization tasks in audio, including voice activity detection (VAD), overlap speech detection (OSD), and speaker resegmentation. It can identify speech regions in audio, detect overlapping speech segments, and optimize speaker diarization results.

Model Features

End-to-End Speaker Diarization

Provides a complete end-to-end solution that can directly process raw audio input and output segmentation results

Overlap Detection

Accurately identifies overlapping speech regions where multiple speakers talk simultaneously

Adjustable Parameters

Offers various adjustable parameters such as activation thresholds and minimum duration to adapt to different application scenarios

Multi-Task Support

Supports multiple related tasks including voice activity detection, overlap detection, and resegmentation

Model Capabilities

Voice Activity Detection

Overlap Detection

Speaker Diarization

Audio Processing

Speaker Logging

Use Cases

Meeting Transcription

Meeting Recording Analysis

Automatically identifies speech regions of different speakers in meeting recordings

Improves accuracy in meeting transcription and note-taking

Speech Analysis

Overlap Detection

Detects instances where multiple speakers talk simultaneously in conversations

Helps understand complex conversational scenarios

Speech Processing

Speaker Diarization Optimization

Optimizes existing speaker diarization results

Improves segmentation precision and accuracy

🚀 Speaker segmentation

This open - source model is designed for speaker segmentation, offering solutions for voice activity detection, overlapped speech detection, and resegmentation.

Using this open - source model in production?
Consider switching to pyannoteAI for better and faster options.

Paper | Demo | Blog post

Example

🚀 Quick Start

Relies on pyannote.audio 2.1.1: see installation instructions.

💻 Usage Examples

Basic Usage

# 1. visit hf.co/pyannote/segmentation and accept user conditions
# 2. visit hf.co/settings/tokens to create an access token
# 3. instantiate pretrained model
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation", 
                              use_auth_token="ACCESS_TOKEN_GOES_HERE")

Advanced Usage

Voice activity detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non - speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Overlapped speech detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

Resegmentation

from pyannote.audio.pipelines import Resegmentation
pipeline = Resegmentation(segmentation=model, 
                          diarization="baseline")
pipeline.instantiate(HYPER_PARAMETERS)
resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
# where `baseline` should be provided as a pyannote.core.Annotation instance

Raw scores

from pyannote.audio import Inference
inference = Inference(model)
segmentation = inference("audio.wav")
# `segmentation` is a pyannote.core.SlidingWindowFeature
# instance containing raw segmentation scores like the 
# one pictured above (output)

📚 Documentation

Citation

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

Reproducible research

In order to reproduce the results of the paper "End-to-end speaker segmentation for overlap-aware resegmentation", use pyannote/segmentation@Interspeech2021 with the following hyper - parameters:

Property	AMI Mix - Headset	DIHARD3	VoxConverse
Voice activity detection - `onset`	0.684	0.767	0.767
Voice activity detection - `offset`	0.577	0.377	0.713
Voice activity detection - `min_duration_on`	0.181	0.136	0.182
Voice activity detection - `min_duration_off`	0.037	0.067	0.501
Overlapped speech detection - `onset`	0.448	0.430	0.587
Overlapped speech detection - `offset`	0.362	0.320	0.426
Overlapped speech detection - `min_duration_on`	0.116	0.091	0.337
Overlapped speech detection - `min_duration_off`	0.187	0.144	0.112
Resegmentation of VBx - `onset`	0.542	0.592	0.537
Resegmentation of VBx - `offset`	0.527	0.489	0.724
Resegmentation of VBx - `min_duration_on`	0.044	0.163	0.410
Resegmentation of VBx - `min_duration_off`	0.705	0.182	0.563

Expected outputs (and VBx baseline) are also provided in the /reproducible_research sub - directories.

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers apply for grants to improve it further. If you are an academic researcher, please cite the relevant papers in your own publications using the model. If you work for a company, please consider contributing back to pyannote.audio development (e.g. through unrestricted gifts). We also provide scientific consulting services around speaker diarization and machine listening.

💡 Usage Tip

You need to visit hf.co/pyannote/segmentation and accept user conditions, then visit hf.co/settings/tokens to create an access token before using the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご