Segmentation-3.0 Open-source Speaker Segmentation Model - Free Processing for 10-second Audio to Identify Multiple Speakers

Segmentation 3.0

Developed by pyannote

This is a powerset-encoded speaker diarization model capable of processing 10-second audio clips to identify multiple speakers and their overlapping speech.

Speaker Analysis

PyTorch

Open Source License:MIT #Multi-speaker overlap detection #Speech activity recognition #Real-time audio processing

Downloads 12.6M

Release Time : 9/22/2023

Model Overview

This model is used for speaker diarization, speech activity detection, and overlap detection in audio, supporting identification of up to 3 speakers and their combinations.

Model Features

Powerset Encoding

Uses 7 categories to encode speaker combinations, including single speaker and overlapping speaker scenarios

Multi-task Processing

Simultaneously supports speaker diarization, speech activity detection and overlap detection

Efficient Processing

Optimized for 10-second audio clips, suitable for real-time or batch processing

Model Capabilities

Speaker identification

Speech activity detection

Overlap detection

Multi-speaker scenario processing

Use Cases

Meeting transcription

Meeting speaker identification

Automatically identify different speakers and their speaking times in meeting recordings

Accurately segments each speaker's speech and marks overlapping portions

Speech analysis

Speech activity detection

Detect speech segments vs non-speech segments in audio

Precisely identifies speech regions and filters silent parts

Overlapping speech analysis

Identify situations where multiple people are speaking simultaneously

Accurately marks overlapping speech regions

🚀 "Powerset" Speaker Segmentation

This open - source model focuses on speaker segmentation. It takes in 10 - second mono audio sampled at 16kHz and outputs speaker diarization as a matrix. It offers a practical solution for audio processing tasks such as speaker identification and speech analysis.

Using this open - source model in production?
Consider switching to pyannoteAI for better and faster options.

🚀 Quick Start

Prerequisites

Install [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0 with pip install pyannote.audio
Accept [pyannote/segmentation - 3.0](https://hf.co/pyannote/segmentation - 3.0) user conditions
Create access token at hf.co/settings/tokens.

Example of Initializing the Model

# instantiate the model
from pyannote.audio import Model
model = Model.from_pretrained(
  "pyannote/segmentation-3.0", 
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

✨ Features

This model ingests 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a (num_frames, num_classes) matrix where the 7 classes are non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3.

Example output

💻 Usage Examples

Basic Usage

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate) 

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk, 
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

Advanced Usage - Speaker Diarization

This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks).

See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.

Advanced Usage - Voice Activity Detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Advanced Usage - Overlapped Speech Detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 Documentation

The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).

It has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0.0 using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.

📄 License

This project is licensed under the MIT license.

📚 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

📋 Metadata

Property	Details
Tags	pyannote, pyannote - audio, pyannote - audio - model, audio, voice, speech, speaker, speaker - diarization, speaker - change - detection, speaker - segmentation, voice - activity - detection, overlapped - speech - detection, resegmentation
License	MIT
Inference	false
Extra Gated Prompt	The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.
Extra Gated Fields	Company/university: text, Website: text

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご