phil-pyannote-speaker-diarization-endpoint Open-source Model - Free to Segment Different Speakers in Audio

Phil Pyannote Speaker Diarization Endpoint

Developed by tawkit

A speaker diarization model based on pyannote.audio 2.0, designed for automatic detection and segmentation of different speakers in audio.

Speaker Analysis Open Source License:MIT #Speaker Diarization #Overlapping Speech Detection #Real-time Speech Processing

Downloads 215

Release Time : 11/13/2022

Model Overview

This model can automatically detect speaker changes in audio, identify different speakers, and support overlapping speech detection. Suitable for scenarios such as meeting records and call recording analysis.

Model Features

Fully Automated Processing

No manual voice activity detection or speaker count specification required; the model automatically completes all processing steps.

Supports Speaker Count Constraints

Allows specifying lower and upper bounds for the number of speakers via parameters to improve segmentation accuracy.

High-Performance Real-Time Processing

Uses GPU acceleration with a real-time factor of approximately 5%, processing one hour of audio in about 3 minutes.

Multi-Dataset Validation

Benchmarked on multiple public datasets, including AMI, DIHARD, and VoxConverse.

Model Capabilities

Speaker Diarization

Voice Activity Detection

Overlapping Speech Detection

Automatic Speech Recognition Assistance

Use Cases

Meeting Records

Meeting Speaker Segmentation

Automatically identifies segments of different speakers in meeting recordings

Accuracy ranges from DER% 12.62%-30.24% across different datasets

Customer Service Call Analysis

Customer Service Dialogue Analysis

Automatically segments dialogue fragments between customer service agents and customers

DER% 30.24% on the CALLHOME dataset

Media Content Processing

Interview Program Subtitle Generation

Automatically identifies speaking times of different guests in interview programs

DER% 12.76% on the VoxConverse dataset

🚀 🎹 Speaker diarization

This project relies on pyannote.audio 2.0 for speaker diarization, offering real - time and accurate solutions for audio processing.

🚀 Quick Start

Relies on pyannote.audio 2.0: see installation instructions.

💻 Usage Examples

Basic Usage

# load the pipeline from Hugginface Hub
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")

# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

Advanced Usage

In case the number of speakers is known in advance, one can use the num_speakers option:

diarization = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

If you feel adventurous, you can try and play with the various pipeline hyper - parameters.
For instance, one can use a more aggressive voice activity detection by increasing the value of segmentation_onset threshold:

hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)

📚 Documentation

Benchmark

Real - time factor

Real - time factor is around 5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 3 minutes to process a one hour conversation.

Accuracy

This pipeline is benchmarked on a growing collection of datasets.

Processing is fully automatic:

no manual voice activity detection (as is sometimes the case in the literature)
no manual number of speakers (though it is possible to provide it to the pipeline)
no fine - tuning of the internal models nor tuning of the pipeline hyper - parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):

no forgiveness collar
evaluation of overlapped speech

Property	Details
Benchmark	DER% \| FA% \| Miss% \| Conf% \| Expected output \| File - level evaluation
AISHELL - 4	14.61 \| 3.31 \| 4.35 \| 6.95 \| RTTM \| eval
AMI Mix - Headset [only_words](https://github.com/BUTSpeechFIT/AMI - diarization - setup)	18.21 \| 3.28 \| 11.07 \| 3.87 \| RTTM \| eval
AMI Array1 - 01 [only_words](https://github.com/BUTSpeechFIT/AMI - diarization - setup)	29.00 \| 2.71 \| 21.61 \| 4.68 \| [RTTM](reproducible_research/2022.07/AMI - SDM.SpeakerDiarization.only_words.test.rttm) \| [eval](reproducible_research/2022.07/AMI - SDM.SpeakerDiarization.only_words.test.eval)
CALLHOME Part2	30.24 \| 3.71 \| 16.86 \| 9.66 \| RTTM \| eval
DIHARD 3 Full	20.99 \| 4.25 \| 10.74 \| 6.00 \| RTTM \| eval
[REPERE Phase 2](https://islrn.org/resources/360 - 758 - 359 - 485 - 0/)	12.62 \| 1.55 \| 3.30 \| 7.76 \| RTTM \| eval
VoxConverse v0.0.2	12.76 \| 3.45 \| 3.85 \| 5.46 \| RTTM \| eval

📄 License

This project is licensed under the MIT license.

✉️ Support

For commercial enquiries and scientific consulting, please contact me.
For technical questions and bug reports, please check pyannote.audio Github repository.

📖 Citations

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご