The open-source speaker recognition model "hubert-large-superb-sid" - Highly practical for voice classification tasks

Hubert Large Superb Sid

Developed by superb

Speaker recognition model based on Hubert-Large architecture, trained on the VoxCeleb1 dataset for speech classification tasks

Speaker Analysis

Transformers

EnglishOpen Source License:Apache-2.0 #Speaker Recognition #High Accuracy #16kHz Audio

Downloads 349

Release Time : 3/2/2022

Model Overview

This model is a speaker recognition system based on the Hubert-Large architecture, specifically designed to classify speech segments into specific speaker identities. The model is pre-trained on 16kHz sampled speech data and is suitable for speaker recognition tasks.

Model Features

High Accuracy

Achieves 90.35% accuracy on the VoxCeleb1 test set

16kHz Sampling Support

Optimized specifically for 16kHz sampled speech data

Pre-trained Model Fine-tuning

Fine-tuned based on the hubert-large-ll60k pre-trained model

Model Capabilities

Speaker Recognition

Speech Classification

Audio Feature Extraction

Use Cases

Security Authentication

Voice Biometrics

Used for voice-based authentication systems

Accurately recognizes registered users' voice characteristics

Speech Analysis

Speaker Diarization

Distinguishing between different speakers in meeting recordings

Helps automatically generate meeting transcripts with speaker labels

🚀 Hubert-Large for Speaker Identification

A model for speaker identification based on the Hubert architecture, ported for the SUPERB Speaker Identification task.

🚀 Quick Start

This model is a ported version for the SUPERB Speaker Identification task. The base model is hubert-large-ll60k, pretrained on 16kHz sampled speech audio. Ensure your speech input is also sampled at 16Khz when using the model.

✨ Features

Ported from S3PRL's Hubert for the SUPERB Speaker Identification task.
Based on the hubert-large-ll60k base model.
Suitable for multi - class speaker identification tasks using the VoxCeleb1 dataset.

📚 Documentation

Model description

This is a ported version of S3PRL's Hubert for the SUPERB Speaker Identification task. The base model is hubert-large-ll60k, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.

Task and dataset description

Speaker Identification (SI) classifies each utterance for its speaker identity as a multi - class classification, where speakers are in the same predefined set for both training and testing. The widely used VoxCeleb1 dataset is adopted. For the original model's training and evaluation instructions refer to the S3PRL downstream task README.

💻 Usage Examples

Basic Usage

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "si", split="test")

classifier = pipeline("audio-classification", model="superb/hubert-large-superb-sid")
labels = classifier(dataset[0]["file"], top_k=5)

Advanced Usage

import torch
import librosa
from datasets import load_dataset
from transformers import HubertForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "si", split="test")
dataset = dataset.map(map_to_array)

model = HubertForSequenceClassification.from_pretrained("superb/hubert-large-superb-sid")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/hubert-large-superb-sid")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:2]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

📄 License

This model is licensed under the apache-2.0 license.

📊 Eval results

The evaluation metric is accuracy.

	s3prl	transformers
test	`0.9033`	`0.9035`

BibTeX entry and citation info

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu - wen and Chi, Po - Han and Chuang, Yung - Sung and Lai, Cheng - I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan - Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご