Speaker - Diarization - v1 Open - source Speaker Segmentation Model - Process 10 - second Mono Audio for Free and Output Results

Speaker Diarization V1

Developed by objects76

This is a speaker segmentation model based on powerset multi-class cross-entropy loss, capable of processing 10-second mono audio and outputting speaker segmentation results.

Speaker Analysis

PyTorch

Open Source License:MIT #Multi-speaker overlap detection #Real-time speech segmentation #Meeting scenario optimization

Downloads 13

Release Time : 9/9/2024

Model Overview

This model is primarily used for speaker segmentation, voice activity detection, and overlapping speech detection in audio, supporting speech analysis in multi-speaker scenarios.

Model Features

Powerset Multi-class Encoding

Trained using powerset multi-class cross-entropy loss, enabling simultaneous processing of speech segmentation for multiple speakers.

Multi-speaker Support

Capable of identifying up to 3 speakers and their overlapping speech scenarios.

Integration of Multiple Datasets

Training data incorporates several well-known datasets including AISHELL, AliMeeting, and AMI.

Model Capabilities

Speaker segmentation

Voice activity detection

Overlapping speech detection

Multi-speaker recognition

Use Cases

Speech Analysis

Meeting Transcript Analysis

Automatically identifies speech segments from different speakers in meeting recordings

Improves meeting transcription efficiency by automatically distinguishing speakers

Preprocessing for Speech Transcription

Performs speaker segmentation before speech recognition

Enhances transcription accuracy and enables speaker labeling

Audio Processing

Overlapping Speech Detection

Identifies segments where multiple people are speaking simultaneously in audio

Helps analyze dialogue interaction patterns

🚀 "Powerset" Speaker Segmentation

This open - source model ingests 10 - second mono audio sampled at 16kHz and outputs speaker diarization, offering solutions for various audio analysis tasks.

🚀 Quick Start

Requirements

Install pyannote.audio 3.0 with pip install pyannote.audio.
Accept pyannote/segmentation-3.0 user conditions.
Create an access token at hf.co/settings/tokens.

Usage

# instantiate the model
from pyannote.audio import Model
model = Model.from_pretrained(
  "pyannote/segmentation-3.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

✨ Features

Speaker Diarization: This model ingests 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a (num_frames, num_classes) matrix. The 7 classes are non - speech, speaker #1, speaker #2, speaker #3, speakers #1 and #2, speakers #1 and #3, and speakers #2 and #3.
Voice Activity Detection: Can be used in a voice activity detection pipeline.
Overlapped Speech Detection: Can be used in an overlapped speech detection pipeline.

📦 Installation

Install the necessary library using the following command:

pip install pyannote.audio

💻 Usage Examples

Basic Usage

# waveform (first row)
duration, sample_rate, num_channels = 10, 16000, 1
waveform = torch.randn(batch_size, num_channels, duration * sample_rate)

# powerset multi-class encoding (second row)
powerset_encoding = model(waveform)

# multi-label encoding (third row)
from pyannote.audio.utils.powerset import Powerset
max_speakers_per_chunk, max_speakers_per_frame = 3, 2
to_multilabel = Powerset(
    max_speakers_per_chunk,
    max_speakers_per_frame).to_multilabel
multilabel_encoding = to_multilabel(powerset_encoding)

Advanced Usage

Speaker Diarization

This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks). See [pyannote/speaker - diarization - 3.0](https://hf.co/pyannote/speaker - diarization - 3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.

Voice Activity Detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions

Overlapped Speech Detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove overlapped speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-overlapped speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions

📚 Documentation

The various concepts behind this model are described in details in this [paper](https://www.isca - speech.org/archive/interspeech_2023/plaquet23_interspeech.html).

This [companion repository](https://github.com/FrenchKrab/IS2023 - powerset - diarization/) by Alexis Plaquet also provides instructions on how to train or finetune such a model on your own data.

📄 License

This project is licensed under the MIT license.

📄 Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

⚠️ Important Note

Using this open - source model in production? Consider switching to pyannoteAI for better and faster options.

⚠️ Important Note

The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open - source, we will occasionnally email you about premium models and paid services around pyannote.

Property	Details
Tags	pyannote, pyannote - audio, pyannote - audio - model, audio, voice, speech, speaker, speaker - diarization, speaker - change - detection, speaker - segmentation, voice - activity - detection, overlapped - speech - detection, resegmentation
License	MIT
Inference	false

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご