Vit Base Patch16 1024 128.audiomae As2m Ft As20k
A Vision Transformer (ViT)-based audio processing model, pre-trained on AudioSet-2M using self-supervised masked autoencoder (MAE) method and fine-tuned on AudioSet-20k
Downloads 335
Release Time : 11/16/2023
Model Overview
This model is primarily used for audio classification and feature extraction tasks, capable of processing 16kHz sampled audio inputs and outputting classification results or feature vectors
Model Features
Self-Supervised Pre-training
Utilizes masked autoencoder (MAE) method for self-supervised pre-training on AudioSet-2M, effectively learning audio features
Fine-Tuning Optimization
Fine-tuned on the AudioSet-20k dataset to improve performance on specific tasks
Efficient Processing
Processes fixed 1024-frame audio inputs, suitable for batch processing standardized-length audio segments
Model Capabilities
Audio Classification
Audio Feature Extraction
Mel-Spectrogram Analysis
Use Cases
Audio Analysis
Audio Event Detection
Identify specific events or sound categories in audio
Audio Content Understanding
Extract feature representations of audio content for downstream tasks
Featured Recommended AI Models
Š 2025AIbase