The open-source speech intent recognition model wav2vec2-large-superb-ic accurately identifies the intent of speech commands.

Wav2vec2 Large Superb Ic

Developed by superb

Intent classification model based on Wav2Vec2-Large-LV60, fine-tuned on the SUPERB intent classification task for speech command intent recognition

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Speech Intent Recognition #Multi-label Classification #16kHz Audio Processing

Downloads 110

Release Time : 3/2/2022

Model Overview

This model is a fine-tuned version of Facebook's wav2vec2-large-lv60 model on the SUPERB intent classification task, specifically designed to recognize action, object, and location intents in speech commands.

Model Features

High Accuracy

Achieves 95.28% accuracy on the SUPERB test set

Multi-label Classification

Can simultaneously recognize intents across three dimensions in speech commands: action, object, and location

16kHz Audio Support

Optimized specifically for 16kHz sampled speech audio

Model Capabilities

Speech Intent Recognition

Multi-label Classification

Speech Command Understanding

Use Cases

Smart Home

Voice Control Command Understanding

Recognizes user control commands for smart devices, such as 'turn on the kitchen light'

Accurately identifies action (turn on), object (light), and location (kitchen)

Voice Assistants

User Intent Understanding

Understands the deeper intent behind user voice commands

Helps voice assistants respond to user requests more accurately

🚀 Wav2Vec2-Large for Intent Classification

This model is designed for intent classification using speech data, leveraging the power of Wav2Vec2.

🚀 Quick Start

This model is a ported version for intent classification. To use it, ensure your speech input is sampled at 16kHz as the base model is pretrained on 16kHz sampled speech audio.

✨ Features

Based on the wav2vec2-large-lv60 base model.
Ported from S3PRL's Wav2Vec2 for the SUPERB Intent Classification task.
Uses the Fluent Speech Commands dataset for intent classification.

📚 Documentation

Model description

This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Intent Classification task.

The base model is wav2vec2-large-lv60, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

For more information refer to SUPERB: Speech processing Universal PERformance Benchmark

Task and dataset description

Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. SUPERB uses the Fluent Speech Commands dataset, where each utterance is tagged with three intent labels: action, object, and location.

For the original model's training and evaluation instructions refer to the S3PRL downstream task README.

💻 Usage Examples

Basic Usage

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "ic", split="test")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-large-superb-ic")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-large-superb-ic")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:4]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits

action_ids = torch.argmax(logits[:, :6], dim=-1).tolist()
action_labels = [model.config.id2label[_id] for _id in action_ids]

object_ids = torch.argmax(logits[:, 6:20], dim=-1).tolist()
object_labels = [model.config.id2label[_id + 6] for _id in object_ids]

location_ids = torch.argmax(logits[:, 20:24], dim=-1).tolist()
location_labels = [model.config.id2label[_id + 20] for _id in location_ids]

📄 License

This model is licensed under the apache-2.0 license.

📊 Eval results

The evaluation metric is accuracy.

	s3prl	transformers
test	`0.9528`	`N/A`

BibTeX entry and citation info

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu-wen and Chi, Po-Han and Chuang, Yung-Sung and Lai, Cheng-I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan-Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご