V

Vit Base Patch16 1024 128.audiomae As2m Ft As20k

Developed by gaunernst
A Vision Transformer (ViT)-based audio processing model, pre-trained on AudioSet-2M using self-supervised masked autoencoder (MAE) method and fine-tuned on AudioSet-20k
Downloads 335
Release Time : 11/16/2023

Model Overview

This model is primarily used for audio classification and feature extraction tasks, capable of processing 16kHz sampled audio inputs and outputting classification results or feature vectors

Model Features

Self-Supervised Pre-training
Utilizes masked autoencoder (MAE) method for self-supervised pre-training on AudioSet-2M, effectively learning audio features
Fine-Tuning Optimization
Fine-tuned on the AudioSet-20k dataset to improve performance on specific tasks
Efficient Processing
Processes fixed 1024-frame audio inputs, suitable for batch processing standardized-length audio segments

Model Capabilities

Audio Classification
Audio Feature Extraction
Mel-Spectrogram Analysis

Use Cases

Audio Analysis
Audio Event Detection
Identify specific events or sound categories in audio
Audio Content Understanding
Extract feature representations of audio content for downstream tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase