ced-small Open-source Audio Annotation Model - Achieve High-quality Audioset Annotations Based on Advanced Technologies

Home

Ced Small

Developed by mispeech

CED is a simple audio tagging model based on ViT-Transformer, achieving state-of-the-art performance on Audioset.

Audio Classification

Transformers

Open Source License:Apache-2.0 #Lightweight Audio Classification #Variable-Length Input #Efficient Transformer

Downloads 18

Release Time : 11/24/2023

Model Overview

CED is a Transformer model for audio classification, specifically optimized for audio tagging tasks, supporting variable-length input and simplifying the fine-tuning process.

Model Features

Simplified Fine-Tuning

Batch normalization for Mel spectrograms eliminates the need to precompute dataset mean/variance during fine-tuning.

Variable-Length Input Support

Breaks the traditional Transformer's 10-second segment limitation, enhancing model generalization.

Efficient Training/Inference

Optimized chunking strategy significantly reduces computational costs compared to AST models.

High-Performance Compact Model

The 10M-parameter CED model outperforms most 80M-parameter solutions.

Model Capabilities

Audio Classification

Audio Tagging

Sound Event Detection

Use Cases

Sound Recognition

Environmental Sound Classification

Identify various types of environmental sounds

Achieves 49.6 mAP on Audioset

Specific Sound Detection

Detect specific sound events like finger snaps

Accurately recognizes 527 sound categories

🚀 CED-Small Model

CED are simple ViT-Transformer-based models for audio tagging, achieving state-of-the-art performance on Audioset.

🚀 Quick Start

CED models offer a novel approach to audio tagging, leveraging the power of ViT-Transformer architectures. They are designed to provide high performance on the Audioset dataset while addressing several limitations of existing models.

✨ Features

Simplification for finetuning: Batchnormalization of Mel - Spectrograms. This eliminates the need to compute mean/variance over the dataset during finetuning, unlike AST.
Support for variable length inputs: Most other models use static time - frequency position embeddings, which can limit generalization to shorter segments. CED models handle variable length inputs more effectively.
Training/Inference speedup: Utilizes 64 - dimensional mel - filterbanks and 16x16 non - overlapping patches, resulting in fewer patches compared to AST, thus speeding up training and inference.
Performance: CED models with relatively fewer parameters (e.g., 10M) outperform many previous approaches with much larger parameter counts (~80M).

Model Performance

Model	Parameters (M)	AS - 20K (mAP)	AS - 2M (mAP)
CED - Tiny	5.5	36.5	48.1
CED - Mini	9.6	38.5	49.0
CED - Small	22	41.6	49.6
CED - Base	86	44.0	50.0

Model Sources

Original Repository: CED Original
Repository: CED on Hugging Face
Paper: CED: Consistent ensemble distillation for audio tagging
Demo: CED Demo

📦 Installation

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_ced.git

💻 Usage Examples

Basic Usage

>>> from ced_model.feature_extraction_ced import CedFeatureExtractor
>>> from ced_model.modeling_ced import CedForAudioClassification

>>> model_name = "mispeech/ced-small"
>>> feature_extractor = CedFeatureExtractor.from_pretrained(model_name)
>>> model = CedForAudioClassification.from_pretrained(model_name)

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")

>>> import torch
>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_id = torch.argmax(logits, dim=-1).item()
>>> model.config.id2label[predicted_class_id]
'Finger snapping'

Fine - tuning

You can refer to the example_finetune_esc50.ipynb notebook to learn how to train a linear head on the ESC - 50 dataset with the CED encoder frozen.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご