cnn8rnn-w2vmean-audiocaps-grounding开源音频定位模型

Home

Cnn8rnn W2vmean Audiocaps Grounding

Developed by wsntxxn

这是一个文本到音频的定位模型，能够预测音频片段中特定声音事件发生的概率。

文本生成音频

Transformers

EnglishOpen Source License:Apache-2.0 #音频事件定位 #文本到音频匹配 #40毫秒高精度

Downloads 456

Release Time : 6/22/2024

Model Overview

该模型用于音频事件定位，给定音频片段和文本提示，可以预测事件发生的概率，时间分辨率为40毫秒。

Model Features

高时间分辨率

能够以40毫秒的时间分辨率预测音频事件发生的概率。

简单有效架构

采用Cnn8Rnn音频编码器和单嵌入层文本编码器的简单架构。

弱监督训练

在AudioCaps数据集上进行弱监督训练。

Model Capabilities

音频事件定位

文本到音频匹配

声音事件概率预测

Use Cases

音频分析

音频内容检索

在长音频中定位特定声音事件的发生时间点。

可精确到40毫秒的时间分辨率

多媒体内容分析

分析视频或音频内容中特定声音事件的出现情况。

🚀 Transformers - 文本到音频定位模型

这是一个文本到音频的定位模型，它能够根据音频片段和描述声音事件的文本提示，以40毫秒的时间分辨率预测事件发生的概率，为音频分类等任务提供了强大的支持。

🚀 快速开始

本模型是一个文本到音频的定位模型。给定一个音频片段和一个描述声音事件的文本提示，该模型可以以40毫秒的时间分辨率预测该事件的概率。

它在 AudioCaps 数据集上进行训练，采用了简单的架构：Cnn8Rnn 音频编码器 + 单层嵌入层文本编码器。

💻 使用示例

基础用法

import torch
import torchaudio
from transformers import AutoModel


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(
    "wsntxxn/cnn8rnn-w2vmean-audiocaps-grounding",
    trust_remote_code=True
).to(device)

wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True).to(device)

text = ["a man speaks", "a dog is barking"]

with torch.no_grad():
    output = model(
        audio=wav_batch,
        audio_len=[wav1.size(0), wav2.size(0)],
        text=text
    )
    # output: (2, n_seconds * 25)

📚 详细文档

模型引用

如果您在研究中使用了该模型，请引用以下论文：

@article{xu2024towards,
    title={Towards Weakly Supervised Text-to-Audio Grounding},
    author={Xu, Xuenan and Ma, Ziyang and Wu, Mengyue and Yu, Kai},
    journal={arXiv preprint arXiv:2401.02584},
    year={2024}
}