LanguageBind_Video開源多模態預訓練框架 - 藉助語言語義實現視頻多模態應用

首頁

Languagebind Video

由LanguageBind開發

LanguageBind是一種通過語言語義對齊將視頻-語言預訓練擴展至N模態的多模態預訓練框架，被ICLR 2024收錄。

多模態對齊

Transformers

開源協議:MIT #多模態對齊 #零樣本學習 #視頻語言預訓練

下載量 166

發布時間 : 10/6/2023

模型概述

LanguageBind採用以語言為核心的多模態預訓練框架，通過語言橋接不同模態，充分利用語言模態語義豐富的特性。

模型特點

高性能免中間模態

通過語言橋接不同模態，充分利用語言模態語義豐富的特性，可輕鬆擴展至分割、檢測等任務，理論上支持無限模態擴展。

多模態全對齊海量數據集

發佈VIDAL-10M數據集，包含1000萬條視頻、紅外、深度、音頻與語言數據，極大拓展了視覺模態邊界。

多視角語言增強

創新性提出融合元數據、空間、時序的多視角語言描述方法，並通過ChatGPT強化語義，為各模態構建優質語義對齊空間。

模型能力

多模態語義對齊

視頻理解

音頻理解

紅外圖像理解

深度圖像理解

語言語義增強

使用案例

視頻理解

視頻內容分析

通過視頻與語言的語義對齊，實現對視頻內容的深度理解。

在多個視頻理解任務上達到業界最佳性能。

音頻理解

音頻內容分析

通過音頻與語言的語義對齊，實現對音頻內容的深度理解。

在5個數據集上達到業界最佳性能。

🚀 【ICLR 2024 🔥】LanguageBind: 通過基於語言的語義對齊將視頻-語言預訓練擴展到 N 模態

LanguageBind 是一種以語言為中心的多模態預訓練方法，以語言作為不同模態之間的紐帶。它提出了包含視頻、紅外、深度、音頻及對應語言的 VIDAL - 10M 數據集，還對語言進行多視圖增強描述用於訓練。該方法性能出色，且無需中間模態，可輕鬆擴展到分割、檢測等任務。

🚀 快速開始

本地演示

強烈推薦嘗試我們的網頁演示，它整合了 LanguageBind 當前支持的所有功能。

python gradio_app.py

在線演示

我們在 Huggingface Spaces 上提供了在線演示。在這個演示中，你可以計算模態與語言之間的相似度，例如音頻與語言、視頻與語言、深度與圖像之間的相似度。

✨ 主要特性

💡 高性能，無需中間模態

LanguageBind 是一種以語言為中心的多模態預訓練方法，以語言作為不同模態之間的紐帶，因為語言模態已經得到了充分的探索，並且包含豐富的語義。

下圖展示了 LanguageBind 的架構。LanguageBind 可以輕鬆擴展到分割、檢測任務，並且有可能擴展到無限的模態。

⚡️ 多模態、完全對齊且海量的數據集

我們提出了 VIDAL - 10M，這是一個包含 1000 萬條數據的數據集，涵蓋了視頻、紅外、深度、音頻及其對應的語言，極大地擴展了視覺模態之外的數據。

第二張圖展示了我們提出的 VIDAL - 10M 數據集，它包含視頻、紅外、深度、音頻和語言五種模態。

🔥 用於訓練的多視圖增強描述

我們對語言進行了多視圖增強。我們生成了結合元數據、空間和時間的多視圖描述，以極大地增強語言的語義信息。此外，我們還進一步使用 ChatGPT 增強語言，為每個模態對齊的語言創建一個良好的語義空間。

📦 安裝指南

Python >= 3.8
Pytorch >= 1.13.1
CUDA 版本 >= 11.6
安裝所需的包：

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

💻 使用示例

基礎用法

import torch
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

if __name__ == '__main__':
    device = 'cuda:0'
    device = torch.device(device)
    clip_type = {
        'video': 'LanguageBind_Video_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio_FT',  # also LanguageBind_Audio
        'thermal': 'LanguageBind_Thermal',
        'image': 'LanguageBind_Image',
        'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir')
    model = model.to(device)
    model.eval()
    pretrained_ckpt = f'lb203/LanguageBind_Image'
    tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    image = ['assets/image/0.jpg', 'assets/image/1.jpg']
    audio = ['assets/audio/0.wav', 'assets/audio/1.wav']
    video = ['assets/video/0.mp4', 'assets/video/1.mp4']
    depth = ['assets/depth/0.png', 'assets/depth/1.png']
    thermal = ['assets/thermal/0.jpg', 'assets/thermal/1.jpg']
    language = ["Training a parakeet to climb up a ladder.", 'A lion climbing a tree to catch a monkey.']

    inputs = {
        'image': to_device(modality_transform['image'](image), device),
        'video': to_device(modality_transform['video'](video), device),
        'audio': to_device(modality_transform['audio'](audio), device),
        'depth': to_device(modality_transform['depth'](depth), device),
        'thermal': to_device(modality_transform['thermal'](thermal), device),
    }
    inputs['language'] = to_device(tokenizer(language, max_length=77, padding='max_length',
                                             truncation=True, return_tensors='pt'), device)

    with torch.no_grad():
        embeddings = model(inputs)

    print("Video x Text: \n",
          torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Image x Text: \n",
          torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Depth x Text: \n",
          torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Audio x Text: \n",
          torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Thermal x Text: \n",
          torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，將返回以下結果：

Video x Text: 
 [[9.9989331e-01 1.0667283e-04]
 [1.3255903e-03 9.9867439e-01]]
Image x Text: 
 [[9.9990666e-01 9.3292067e-05]
 [4.6132666e-08 1.0000000e+00]]
Depth x Text: 
 [[0.9954276  0.00457235]
 [0.12042473 0.8795753 ]]
Audio x Text: 
 [[0.97634876 0.02365119]
 [0.02917843 0.97082156]]
Thermal x Text: 
 [[0.9482511  0.0517489 ]
 [0.48746133 0.5125386 ]]

高級用法

應急零樣本

由於 LanguageBind 將每個模態綁定在一起，我們還發現了應急零樣本的用法。使用方法非常簡單：

print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy())
print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，你將得到：

Video x Audio: 
 [[1.0000000e+00 0.0000000e+00]
 [3.1150486e-32 1.0000000e+00]]
Image x Depth: 
 [[1. 0.]
 [0. 1.]]
Image x Thermal: 
 [[1. 0.]
 [0. 1.]]

不同分支用於跨語言任務

此外，LanguageBind 可以分解為不同的分支來處理不同的任務。請注意，我們沒有對圖像進行訓練，只是從 OpenCLIP 進行初始化。

熱成像

import torch
from languagebind import LanguageBindThermal, LanguageBindThermalTokenizer, LanguageBindThermalProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Thermal'
model = LanguageBindThermal.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindThermalTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
thermal_process = LanguageBindThermalProcessor(model.config, tokenizer)

model.eval()
data = thermal_process([r"your/thermal.jpg"], ['your text'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

深度

import torch
from languagebind import LanguageBindDepth, LanguageBindDepthTokenizer, LanguageBindDepthProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Depth'
model = LanguageBindDepth.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindDepthTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
depth_process = LanguageBindDepthProcessor(model.config, tokenizer)

model.eval()
data = depth_process([r"your/depth.png"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

視頻

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

音頻

import torch
from languagebind import LanguageBindAudio, LanguageBindAudioTokenizer, LanguageBindAudioProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Audio_FT'  # also 'LanguageBind/LanguageBind_Audio'
model = LanguageBindAudio.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindAudioTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
audio_process = LanguageBindAudioProcessor(model.config, tokenizer)

model.eval()
data = audio_process([r"your/audio.wav"], ['your audio.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

圖像

請注意，我們的圖像編碼器與 OpenCLIP 相同。不像其他模態那樣進行了微調。

import torch
from languagebind import LanguageBindImage,  LanguageBindImageTokenizer,  LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

📚 詳細文檔

🐳 模型庫

表格中的名稱代表不同的編碼器模型。例如，LanguageBind/LanguageBind_Video_FT 代表完全微調版本，而 LanguageBind/LanguageBind_Video 代表 LoRA 微調版本。

你可以在推薦的 API 使用方法中自由替換它們。我們建議使用完全微調版本，因為它提供更強的性能。

模態	LoRA 微調	完全微調
視頻	LanguageBind_Video	LanguageBind_Video_FT
音頻	LanguageBind_Audio	LanguageBind_Audio_FT
深度	LanguageBind_Depth	-
熱成像	LanguageBind_Thermal	-

版本	微調方式	模型大小	幀數	Hugging Face 鏈接	MSR - VTT	DiDeMo	ActivityNet	MSVD
LanguageBind_Video	LoRA	大	8	鏈接	42.6	37.8	35.1	52.2
LanguageBind_Video_FT	完全微調	大	8	鏈接	42.7	38.1	36.9	53.5
LanguageBind_Video_V1.5_FT	完全微調	大	8	鏈接	42.8	39.7	38.4	54.1
LanguageBind_Video_V1.5_FT	完全微調	大	12	即將推出	-	-	-	-
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大	8	鏈接	44.8	39.9	41.0	53.7
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大	12	即將推出	-	-	-	-

💥 VIDAL - 10M

數據集詳情請參考 DATASETS.md。

🗝️ 訓練與驗證

訓練和驗證說明請參考 TRAIN_AND_VALIDATE.md。

👍 致謝

OpenCLIP 一個開源的預訓練框架。
CLIP4Clip 一個開源的視頻 - 文本檢索框架。
sRGB - TIR 一個用於生成紅外（熱成像）圖像的開源框架。
GLPN 一個用於生成深度圖像的開源框架。

📄 許可證

本項目的大部分內容根據 LICENSE 文件中的 MIT 許可證發佈。
本項目的數據集根據 DATASET_LICENSE 文件中的 CC - BY - NC 4.0 許可證發佈。

✏️ 引用

如果您發現我們的論文和代碼在您的研究中有用，請考慮給我們一個星星 :star: 並進行引用 :pencil:。

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}