DASS_small_AudioSet_47.2 Open-source Audio Classification Model - Achieve State-of-the-Art Performance in AudioSet Classification with a Small Footprint

DASS Small AudioSet 47.2

Developed by saurabhati

The first state space model to surpass Transformer-based audio classifiers, achieving state-of-the-art performance on AudioSet audio classification tasks while significantly reducing model size.

Audio Classification

Transformers

Open Source License:Bsd-3-clause #Efficient Audio Classification #Long-duration Audio Processing #Lightweight Model

Downloads 47

Release Time : 3/29/2025

Model Overview

An audio classification model fine-tuned on AudioSet-2M, utilizing a state space architecture that outperforms traditional Transformer models in audio classification tasks with enhanced duration robustness.

Model Features

Efficient Performance

DASS-small with only 30M parameters outperforms the 87M-parameter AST model (mAP 47.2 vs 45.9).

Duration Robustness

Performance remains stable with long audio inputs, maintaining 96% of 10-second input performance even with 50-second inputs.

Ultra-long Audio Processing

Capable of processing audio inputs up to 2.5 hours long on a single A6000 GPU while maintaining 62% of 10-second input performance.

Distillation Learning

Trained using KL divergence loss against the teacher AST model to enhance learning efficiency.

Model Capabilities

Audio Classification

Multi-label Audio Recognition

Long Audio Processing

Use Cases

Audio Content Analysis

Environmental Sound Classification

Identify various sound categories in natural or urban environments.

Accurately recognizes animal sounds, vehicle noises, and other sound categories.

Audio Event Detection

Detect specific events or sounds in audio streams.

Capable of detecting critical events like glass breaking or alarm sounds.

Media Content Management

Video Content Tagging

Assist video content classification through audio analysis.

Improves efficiency in video content retrieval and classification.

🚀 DASS: Distilled Audio State-space Models

DASS is an audio classification model finetuned on AudioSet - 2M. It outperforms transformer - based audio classifiers, reduces model size, and shows strong duration robustness.

🚀 Quick Start

Use the following steps to start using the DASS model:

import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor

config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)

waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')

with torch.no_grad():
    logits = torch.sigmoid(audio_model(**inputs).logits)

predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']

✨ Features

High - performance: DASS is the first state - space model that outperforms transformer - based audio classifiers such as AST, HTS - AT, and Audio - MAE on the audio - classification task.
Small model size: It significantly reduces the model size. For example, DASS - small contains one - third of the parameters of AST but still outperforms it.
Duration robustness: It is significantly more duration robust than the AST model. It can maintain a high level of performance even when the input audio duration is much longer than the training duration.

📚 Documentation

Model Details

The DASS model is based on the VMamba: Visual State Space Model applied to audio. It is trained with binary cross - entropy loss with respect to ground - truth labels and KL - divergence loss with respect to the teacher AST model.

Results

The following table shows the results of DASS models finetuned and evaluated on AudioSet - 2M:

Property	Details
Model Type	DASS - Small, DASS - Medium
Pretrain	IN SL
Params (DASS - Small)	30M
mAP (DASS - Small)	47.2
Params (DASS - Medium)	49M
mAP (DASS - Medium)	47.6
Comparison with other models
	Params
AST	87M
HTS - AT	31M
PaSST
Audio - MAE	86M
AuM	26M
Audio Mamba	40M

📄 License

This project uses the BSD 3 - Clause license.

📖 Citation

@article{bhati2024dass,
  title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
  author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
  journal={arXiv preprint arXiv:2407.04082},
  year={2024}
}

Acknowledgements

This project is based on AST(paper, code), VMamba(paper, code). Thanks for their excellent works. Please make sure to check them out.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご