Cnn8rnn W2vmean Audiocaps Grounding
C
Cnn8rnn W2vmean Audiocaps Grounding
Developed by wsntxxn
This is a text-to-audio grounding model capable of predicting the probability of specific sound events occurring in audio segments.
Downloads 456
Release Time : 6/22/2024
Model Overview
This model is used for audio event localization. Given an audio segment and a text prompt, it can predict the probability of event occurrence with a time resolution of 40 milliseconds.
Model Features
High Temporal Resolution
Capable of predicting audio event probabilities with a 40ms time resolution.
Simple and Effective Architecture
Adopts a simple architecture with a Cnn8Rnn audio encoder and a single embedding layer text encoder.
Weakly Supervised Training
Trained on the AudioCaps dataset using weakly supervised learning.
Model Capabilities
Audio Event Localization
Text-to-Audio Matching
Sound Event Probability Prediction
Use Cases
Audio Analysis
Audio Content Retrieval
Locate the occurrence time points of specific sound events in long audio clips.
Precision up to 40ms time resolution
Multimedia Content Analysis
Analyze the occurrence of specific sound events in video or audio content.
Featured Recommended AI Models
Š 2025AIbase