Open-source Speaker Identification Model wav2vec2-base-superb-sid

Wav2vec2 Base Superb Sid

Developed by superb

A speaker identification model fine-tuned on the VoxCeleb1 dataset based on the Wav2Vec2-base pre-trained model, designed for voice classification tasks

Speaker Analysis

Transformers

EnglishOpen Source License:Apache-2.0 #Speaker Identification #16kHz Audio Processing #VoxCeleb1 Dataset

Downloads 1,489

Release Time : 3/2/2022

Model Overview

This model is a ported version of S3PRL's Wav2Vec2 for the SUPERB speaker identification task, capable of classifying each speech segment by its speaker identity

Model Features

Based on Wav2Vec2 Pre-trained Model

Uses facebook/wav2vec2-base as the base model, which is pre-trained on 16kHz sampled speech audio

Fine-tuned on VoxCeleb1 Dataset

Fine-tuned on the widely-used VoxCeleb1 dataset, suitable for speaker identification tasks

High Accuracy

Achieves 75.18% accuracy on the test set

Model Capabilities

Speaker Identification

Voice Classification

Audio Feature Extraction

Use Cases

Security Verification

Voiceprint Recognition System

Used for speaker identification in authentication systems

Can identify specific speaker identities

Speech Analysis

Meeting Transcription Analysis

Identifies speech segments from different speakers in meeting recordings

Automatically distinguishes between different speakers

🚀 Wav2Vec2-Base for Speaker Identification

This project provides a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Identification task, enabling accurate speaker identification.

🚀 Quick Start

You can quickly start using this model through the following steps. First, you can use the model via the Audio Classification pipeline:

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "si", split="test")

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-sid")
labels = classifier(dataset[0]["file"], top_k=5)

Or use the model directly:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "si", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-sid")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-sid")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:2]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

✨ Features

This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Identification task.
The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio.
Speaker Identification (SI) classifies each utterance for its speaker identity as a multi - class classification, using the widely used VoxCeleb1 dataset.

💻 Usage Examples

Basic Usage

You can use the model via the Audio Classification pipeline:

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "si", split="test")

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-sid")
labels = classifier(dataset[0]["file"], top_k=5)

Advanced Usage

Use the model directly:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "si", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-sid")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-sid")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:2]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

📚 Documentation

Model description

This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Speaker Identification task. The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.

Task and dataset description

Speaker Identification (SI) classifies each utterance for its speaker identity as a multi - class classification, where speakers are in the same predefined set for both training and testing. The widely used VoxCeleb1 dataset is adopted. For the original model's training and evaluation instructions refer to the S3PRL downstream task README.

Eval results

The evaluation metric is accuracy.

	s3prl	transformers
test	`0.7518`	`0.7518`

BibTeX entry and citation info

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu - wen and Chi, Po - Han and Chuang, Yung - Sung and Lai, Cheng - I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan - Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご