Wav2vec2-base-superb-ic open-source speech intent classification model - Accurately recognize the intent of speech commands

Wav2vec2 Base Superb Ic

Developed by superb

This model is an intent classification model based on Wav2Vec2-base, specifically designed for recognizing intents in voice commands, supporting the classification of speech segments into predefined intent categories.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Speech Intent Recognition #Multi-label Classification #16kHz Audio Processing

Downloads 779

Release Time : 3/2/2022

Model Overview

This model is a ported version of S3PRL's Wav2Vec2 model for the SUPERB intent classification task, used to classify speech segments into predefined categories such as action, object, and location to determine the speaker's intent.

Model Features

Powerful Speech Representation Based on Wav2Vec2

Utilizes the pre-trained Wav2Vec2-base model to effectively capture semantic information in speech.

Multi-label Intent Classification

Simultaneously identifies three intent labels in speech: action, object, and location.

16kHz Sampling Rate Support

The model is pre-trained and optimized on 16kHz sampled speech audio.

Model Capabilities

Speech Intent Recognition

Multi-label Classification

Speech Signal Processing

Use Cases

Smart Home Control

Voice Command Understanding

Recognizes user control commands for smart home devices, such as 'Turn on the living room light'.

Accurately identifies action (turn on), object (light), and location (living room)

Voice Assistants

User Intent Understanding

Helps voice assistants understand user request intents.

Improves the accuracy and naturalness of voice interactions

🚀 Wav2Vec2-Base for Intent Classification

This is a ported model for intent classification using Wav2Vec2, which can classify speech utterances into predefined classes.

🚀 Quick Start

This model is a ported version for intent classification. When using it, ensure your speech input is sampled at 16Khz as the base model wav2vec2-base is pretrained on 16kHz sampled speech audio.

✨ Features

Ported from S3PRL's Wav2Vec2 for the SUPERB Intent Classification task.
Can classify utterances into predefined classes to determine the intent of speakers, with three intent labels (action, object, and location) in the SUPERB dataset.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "ic", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-ic")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ic")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:4]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits

action_ids = torch.argmax(logits[:, :6], dim=-1).tolist()
action_labels = [model.config.id2label[_id] for _id in action_ids]

object_ids = torch.argmax(logits[:, 6:20], dim=-1).tolist()
object_labels = [model.config.id2label[_id + 6] for _id in object_ids]

location_ids = torch.argmax(logits[:, 20:24], dim=-1).tolist()
location_labels = [model.config.id2label[_id + 20] for _id in location_ids]

📚 Documentation

Model description

This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Intent Classification task. The base model is wav2vec2-base, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. For more information refer to SUPERB: Speech processing Universal PERformance Benchmark.

Task and dataset description

Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. SUPERB uses the Fluent Speech Commands dataset, where each utterance is tagged with three intent labels: action, object, and location. For the original model's training and evaluation instructions refer to the S3PRL downstream task README.

Eval results

The evaluation metric is accuracy.

	s3prl	transformers
test	`0.9235`	`N/A`

BibTeX entry and citation info

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu-wen and Chi, Po-Han and Chuang, Yung-Sung and Lai, Cheng-I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan-Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}

📄 License

The license for this model is apache-2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご