Segmentation-3.0 Open-Source Audio Segmentation Model - Freely Detect Speaker Changes and Speech Activities

Segmentation 3.0

Developed by fatymatariq

This is an audio segmentation model capable of detecting speaker changes, voice activity, and overlapping speech, suitable for audio analysis in multi-speaker scenarios.

Audio Processing

PyTorch

Open Source License:MIT #Multi-speaker detection #Overlapping speech recognition #Real-time speech processing

Downloads 1,228

Release Time : 11/21/2024

Model Overview

The model processes 10-second mono audio clips, outputting a speaker diarization matrix with 7 categories, supporting detection of non-speech, single speaker, and overlapping speech scenarios.

Model Features

Power Set Multi-class Encoding

Supports classification of 7 speaker states, including non-speech, single speaker, and overlapping speech scenarios.

High-precision Segmentation

Trained on multiple datasets, it accurately detects speaker changes and voice activity.

Multi-dataset Training

Trained on datasets such as AISHELL, AliMeeting, and AMI, ensuring broad applicability.

Model Capabilities

Speaker diarization

Voice activity detection

Overlapping speech detection

Speaker change detection

Use Cases

Meeting Transcription

Multi-speaker Meeting Transcription

Automatically segments different speakers in meeting recordings for subsequent transcription and analysis.

Improves the accuracy and efficiency of meeting records.

Speech Analysis

Overlapping Speech Detection

Detects overlapping speech segments in audio, suitable for dialogue analysis and speech enhancement.

Enhances the precision of speech processing.

🚀 "Powerset" Speaker Segmentation

This open - source model focuses on speaker segmentation. It takes 10 - second mono audio sampled at 16kHz as input and outputs speaker diarization in a (num_frames, num_classes) matrix. The 7 classes include non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3. It can be very useful for various audio processing tasks such as speaker diarization, voice activity detection, etc.

Using this open - source model in production?
Consider switching to pyannoteAI for better and faster options.

Example output

🚀 Quick Start

Prerequisites

Install the necessary libraries and set up the environment as described in the requirements section.

Basic Usage

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate) 

# powerset multi - class encoding (second row)
powerset_encoding = model(waveform)

# multi - label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk, 
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

✨ Features

Speaker Segmentation: Ingests 10 - second mono audio at 16kHz and outputs speaker diarization in a multi - class matrix.
Multiple Applications: Can be used for speaker diarization, voice activity detection, overlapped speech detection, etc.

📦 Installation

Install [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0 with pip install pyannote.audio.
Accept [pyannote/segmentation - 3.0](https://hf.co/pyannote/segmentation - 3.0) user conditions.
Create access token at hf.co/settings/tokens.

💻 Usage Examples

Basic Usage

# instantiate the model
from pyannote.audio import Model
model = Model.from_pretrained(
  "pyannote/segmentation - 3.0", 
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

Speaker Diarization

This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks). See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.

Voice Activity Detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non - speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Overlapped Speech Detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non - overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 Documentation

The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).

It has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote - audio) 3.0.0 using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.

📄 License

This project is licensed under the MIT license.

Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi - class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.

Property	Details
Model Type	"Powerset" speaker segmentation
Training Data	Combination of the training sets of AISHELL, AliMeeting, AMI, AVA - AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご