Ced-base Open-source Audio Tagging Model - Free Deployment, Demonstrating Advanced Performance on Audioset

Home

Ced Base

Developed by mispeech

CED is a simple audio tagging model based on ViT-Transformer, achieving state-of-the-art performance on Audioset.

Audio Classification

Transformers

Open Source License:Apache-2.0 #Efficient Audio Classification #Variable-Length Input #Lightweight Transformer

Downloads 1,318

Release Time : 11/24/2023

Model Overview

CED is a Transformer model for audio classification, featuring efficient inference speed and excellent performance.

Model Features

Simplified Fine-tuning

Uses batch normalization for Mel spectrograms, eliminating the need to precompute dataset mean/variance during fine-tuning.

Supports Variable-Length Input

Most models use static time-frequency positional encoding, limiting generalization for clips shorter than 10 seconds. CED solves this issue.

Training/Inference Acceleration

Employs 64-dimensional Mel filter banks and 16x16 non-overlapping patches, significantly improving training/inference speed compared to AST models.

Performance Advantage

A CED model with only 10M parameters outperforms most previous solutions with around 80M parameters.

Model Capabilities

Audio Classification

Audio Tagging

Use Cases

Audio Recognition

Finger Snap Recognition

Can accurately identify finger snap sounds in audio

Accurate classification

🚀 CED-Base Model

CED are simple ViT-Transformer-based models for audio tagging, achieving state-of-the-art performance on Audioset. They offer a more efficient and effective solution for audio classification tasks.

✨ Features

Simplification for finetuning: Batchnormalization of Mel-Spectrograms. During finetuning, there's no need to first compute mean/variance over the dataset, which is common for AST.
Support for variable length inputs: Most other models use a static time - frequency position embedding, which hinders the model's generalization to segments shorter than 10s. Many previous transformers simply pad their input to 10s to avoid the performance impact, which in turn slows down training/inference drastically.
Training/Inference speedup: 64 - dimensional mel - filterbanks and 16x16 patches without overlap, leading to 248 patches from a 10s spectrogram. In comparison, AST uses 128 mel - filterbanks with 16x16 (10x10 overlap) convolution, leading to 1212 patches during training/inference. CED - Tiny runs on a common CPU as fast as a comparable MobileNetV3.
Performance: CED with 10M parameters outperforms the majority of previous approaches (~80M).

📦 Installation

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_ced.git

💻 Usage Examples

Basic Usage

>>> from ced_model.feature_extraction_ced import CedFeatureExtractor
>>> from ced_model.modeling_ced import CedForAudioClassification

>>> model_name = "mispeech/ced-base"
>>> feature_extractor = CedFeatureExtractor.from_pretrained(model_name)
>>> model = CedForAudioClassification.from_pretrained(model_name)

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")

>>> import torch
>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_id = torch.argmax(logits, dim=-1).item()
>>> model.config.id2label[predicted_class_id]
'Finger snapping'

Advanced Usage

example_finetune_esc50.ipynb demonstrates how to train a linear head on the ESC - 50 dataset with the CED encoder frozen.

📚 Documentation

Model Performance Table

Model	Parameters (M)	AS - 20K (mAP)	AS - 2M (mAP)
CED - Tiny	5.5	36.5	48.1
CED - Mini	9.6	38.5	49.0
CED - Small	22	41.6	49.6
CED - Base	86	44.0	50.0

Model Sources

Original Repository: https://github.com/RicherMans/CED
Repository: https://github.com/jimbozhang/hf_transformers_custom_model_ced
Paper: CED: Consistent ensemble distillation for audio tagging
Demo: https://huggingface.co/spaces/mispeech/ced - base

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご