wav2vec2-base-superb-sid開源說話人識別模型

首頁

Wav2vec2 Base Superb Sid

由superb開發

基於Wav2Vec2-base預訓練模型，在VoxCeleb1數據集上微調的說話人識別模型，用於語音分類任務

說話人處理

Transformers

英語開源協議:Apache-2.0 #說話人識別 #16kHz音頻處理 #VoxCeleb1數據集

下載量 1,489

發布時間 : 3/2/2022

模型概述

該模型是S3PRL的Wav2Vec2在SUPERB說話人識別任務的移植版本，能夠將每段語音按其說話人身份進行多分類

模型特點

基於Wav2Vec2預訓練模型

使用facebook/wav2vec2-base作為基礎模型，該模型基於16kHz採樣的語音音頻進行預訓練

VoxCeleb1數據集微調

在廣泛使用的VoxCeleb1數據集上進行微調，適用於說話人識別任務

高準確率

在測試集上達到75.18%的準確率

模型能力

說話人識別

語音分類

音頻特徵提取

使用案例

安全驗證

聲紋識別系統

用於身份驗證系統的說話人識別

可識別特定說話人身份

語音分析

會議記錄分析

識別會議錄音中不同發言人的語音片段

自動區分不同說話人

🚀 Wav2Vec2-Base 用於說話人識別

本模型用於說話人識別任務，基於預訓練的 wav2vec2-base 模型，能對語音進行分類以識別說話人身份，在相關數據集上有較好的表現。

🚀 快速開始

你可以通過以下兩種方式使用該模型：

方式一：使用音頻分類管道

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "si", split="test")

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-sid")
labels = classifier(dataset[0]["file"], top_k=5)

方式二：直接使用模型

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "si", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-sid")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-sid")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:2]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

✨ 主要特性

本模型是 S3PRL 的 Wav2Vec2 用於 SUPERB 說話人識別任務的移植版本。
基礎模型是 wav2vec2-base，在 16kHz 採樣的語音音頻上進行了預訓練。使用模型時，請確保輸入的語音也採樣為 16kHz。
說話人識別（SI）將每個話語的說話人身份作為多類分類進行分類，訓練和測試的說話人都在同一預定義集合中，採用了廣泛使用的 VoxCeleb1 數據集。

📦 安裝指南

文檔未提及安裝步驟，可參考相關庫（如 datasets、transformers、torch、librosa 等）的官方安裝說明進行安裝。

💻 使用示例

基礎用法

你可以使用音頻分類管道來使用該模型：

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "si", split="test")

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-sid")
labels = classifier(dataset[0]["file"], top_k=5)

高級用法

直接使用模型進行推理：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "si", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-sid")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-sid")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:2]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

📚 詳細文檔

對於模型描述，更多信息請參考 SUPERB: Speech processing Universal PERformance Benchmark。
對於原始模型的訓練和評估說明，請參考 S3PRL 下游任務 README。

🔧 技術細節

文檔未提及詳細的技術實現細節。

📄 許可證

本模型使用的許可證為 Apache-2.0。

BibTeX 引用和引用信息

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu-wen and Chi, Po-Han and Chuang, Yung-Sung and Lai, Cheng-I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan-Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}