C

Cnn8rnn W2vmean Audiocaps Grounding

Developed by wsntxxn
This is a text-to-audio grounding model capable of predicting the probability of specific sound events occurring in audio segments.
Downloads 456
Release Time : 6/22/2024

Model Overview

This model is used for audio event localization. Given an audio segment and a text prompt, it can predict the probability of event occurrence with a time resolution of 40 milliseconds.

Model Features

High Temporal Resolution
Capable of predicting audio event probabilities with a 40ms time resolution.
Simple and Effective Architecture
Adopts a simple architecture with a Cnn8Rnn audio encoder and a single embedding layer text encoder.
Weakly Supervised Training
Trained on the AudioCaps dataset using weakly supervised learning.

Model Capabilities

Audio Event Localization
Text-to-Audio Matching
Sound Event Probability Prediction

Use Cases

Audio Analysis
Audio Content Retrieval
Locate the occurrence time points of specific sound events in long audio clips.
Precision up to 40ms time resolution
Multimedia Content Analysis
Analyze the occurrence of specific sound events in video or audio content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase