Pyannote-segmentation Open-source Speaker Segmentation Model - Process 10-second Audio to Identify Multiple Speakers and Overlapping Situations

Pyannote Segmentation

Developed by it-just-works

This is a speaker segmentation model based on powerset encoding, capable of processing 10-second audio clips and identifying multiple speakers and their overlapping situations.

Audio Processing

PyTorch

Open Source License:MIT #Multi-speaker overlap detection #Voice activity segmentation #Real-time audio processing

Downloads 771

Release Time : 4/10/2025

Model Overview

This model is used for speaker segmentation in audio, detecting up to 3 speakers and their overlaps, outputting 7 possible speaker combination states.

Model Features

Powerset Encoding

Uses a unique powerset encoding method to handle multi-speaker scenarios, simultaneously identifying individual speakers and overlapping speakers.

Multi-task Support

The same model can be used for speaker segmentation, voice activity detection, and overlapping speech detection.

Efficient Processing

Optimized for 10-second audio clips, suitable for real-time or batch processing.

Model Capabilities

Speaker segmentation

Voice activity detection

Overlapping speech detection

Multi-speaker recognition

Use Cases

Meeting Transcription

Meeting Speech Transcription

Automatically identifies different speakers and their speaking times in meetings.

Accurately segments speech segments of each speaker.

Speech Analysis

Overlapping Speech Detection

Detects situations where multiple people speak simultaneously in conversations.

Identifies overlapping speech segments.

🚀 "Powerset" Speaker Segmentation

This open-source model is designed for speaker segmentation. It takes in 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a matrix. It can be used in various audio analysis tasks such as speaker diarization, voice activity detection, and overlapped speech detection.

Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.

✨ Features

Speaker Segmentation: Ingests 10 seconds of 16kHz mono audio and outputs speaker diarization as a (num_frames, num_classes) matrix with 7 classes including non - speech and different speaker combinations.
Multi - Encoding: Supports powerset multi - class encoding and can be converted to multi - label encoding.
Multiple Applications: Can be used for voice activity detection and overlapped speech detection through pipelines.

📦 Installation

Requirements

Install pyannote.audio 3.0 with pip install pyannote.audio.
Accept pyannote/segmentation-3.0 user conditions.
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate) 

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk, 
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

Advanced Usage - Speaker Diarization

This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks). See pyannote/speaker-diarization-3.0 pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.

Advanced Usage - Voice Activity Detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Advanced Usage - Overlapped Speech Detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 Documentation

The various concepts behind this model are described in details in this paper.

It has been trained by Séverin Baroudi with pyannote.audio 3.0.0 using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

This companion repository by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.

📄 License

This model is licensed under the MIT license.

📖 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

📋 Model Information

Property	Details
Model Type	"Powerset" speaker segmentation
Training Data	Combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse
License	MIT

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.

💡 Usage Tip

If you want to perform speaker diarization of full recordings, use pyannote/speaker-diarization-3.0 pipeline instead of this model alone.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご