Pyannote Segmentation

Developed by philschmid

This is an end-to-end speaker diarization model that supports voice activity detection, overlap speech detection, and resegmentation tasks.

Speaker Analysis

PyTorch

Open Source License:MIT #Overlap speech detection #Speaker diarization #End-to-end resegmentation

Downloads 427

Release Time : 11/8/2022

Model Overview

This model is primarily used for speaker diarization tasks in audio processing, capable of detecting voice activity, identifying overlapping speech regions, and optimizing baseline segmentation results through resegmentation.

Model Features

End-to-end speaker diarization

Adopts an end-to-end architecture to directly handle speaker diarization tasks, simplifying the processing workflow

Overlap speech detection

Accurately identifies overlapping regions where multiple speakers talk simultaneously in audio

Resegmentation optimization

Can optimize baseline segmentation results to improve segmentation accuracy

Multi-dataset validation

Validated on multiple standard datasets including AMI, DIHARD3, and VoxConverse

Model Capabilities

Voice activity detection

Overlap speech recognition

Speaker diarization optimization

Audio feature extraction

Use Cases

Meeting transcription

Meeting audio segmentation

Automatically segments different speaker segments in meeting recordings

Validated effective on the AMI dataset

Speech analysis

Overlap speech detection

Identifies instances of multiple people speaking simultaneously in conversations

Validated effective on the DIHARD3 dataset

Speech processing optimization

Segmentation result optimization

Optimizes and improves existing speech segmentation results

Validated effective on the VoxConverse dataset

tags:

pyannote
pyannote-audio
pyannote-audio-model
audio
voice
speech
speaker
speaker-segmentation
voice-activity-detection
overlapped-speech-detection
resegmentation datasets:
ami
dihard
voxconverse license: mit inference: false

🎹 Speaker segmentation

Example

Model from End-to-end speaker segmentation for overlap-aware resegmentation,
by Hervé Bredin and Antoine Laurent.

Online demo is available as a Hugging Face Space.

Support

For commercial enquiries and scientific consulting, please contact me.
For technical questions and bug reports, please check pyannote.audio Github repository.

Usage

Relies on pyannote.audio 2.0 currently in development: see installation instructions.

Voice activity detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation="pyannote/segmentation")
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Overlapped speech detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation="pyannote/segmentation")
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

Resegmentation

from pyannote.audio.pipelines import Resegmentation
pipeline = Resegmentation(segmentation="pyannote/segmentation", 
                          diarization="baseline")
pipeline.instantiate(HYPER_PARAMETERS)
resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
# where `baseline` should be provided as a pyannote.core.Annotation instance

Raw scores

from pyannote.audio import Inference
inference = Inference("pyannote/segmentation")
segmentation = inference("audio.wav")
# `segmentation` is a pyannote.core.SlidingWindowFeature
# instance containing raw segmentation scores like the 
# one pictured above (output)

Reproducible research

In order to reproduce the results of the paper "End-to-end speaker segmentation for overlap-aware resegmentation ", use pyannote/segmentation@Interspeech2021 with the following hyper-parameters:

Voice activity detection	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.684	0.577	0.181	0.037
DIHARD3	0.767	0.377	0.136	0.067
VoxConverse	0.767	0.713	0.182	0.501

Overlapped speech detection	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.448	0.362	0.116	0.187
DIHARD3	0.430	0.320	0.091	0.144
VoxConverse	0.587	0.426	0.337	0.112

Resegmentation of VBx	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.542	0.527	0.044	0.705
DIHARD3	0.592	0.489	0.163	0.182
VoxConverse	0.537	0.724	0.410	0.563

Expected outputs (and VBx baseline) are also provided in the /reproducible_research sub-directories.

Citation

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご