LanguageBind_Video_V1.5_FT開源模型 - 以語言為紐帶實現多模態語義對齊應用

首頁

Languagebind Video V1.5 FT

由LanguageBind開發

LanguageBind是一種以語言為中心的多模態預訓練方法，通過語言作為不同模態之間的紐帶，實現多模態語義對齊。

多模態對齊

Transformers

開源協議:MIT #多模態語義對齊 #零樣本學習 #視頻-語言預訓練

下載量 853

發布時間 : 11/26/2023

模型概述

LanguageBind通過將語言作為不同模態之間的橋樑，擴展了視頻-語言預訓練至多種模態（如紅外、深度、音頻等），實現了高性能的多模態語義對齊。

模型特點

以語言為中心的多模態對齊

將語言作為不同模態之間的紐帶，利用語言模態豐富的語義信息實現多模態對齊。

多模態、完全對齊的數據集

提供VIDAL-10M數據集，包含1000萬數據，涵蓋視頻、紅外、深度、音頻及其對應的語言。

多視角增強的訓練描述

通過結合元數據、空間和時間信息生成多視角描述，並使用ChatGPT增強語言語義。

模型能力

多模態語義對齊

視頻-語言預訓練

紅外-語言對齊

深度-語言對齊

音頻-語言對齊

使用案例

多模態理解

視頻內容理解

通過視頻和語言的聯合預訓練，實現對視頻內容的深度理解。

在多個數據集上實現最先進的性能

音頻內容理解

通過音頻和語言的聯合預訓練，實現對音頻內容的語義理解。

在5個數據集上實現最先進的性能

跨模態檢索

視頻-文本檢索

實現視頻內容與文本描述之間的高效檢索。

音頻-文本檢索

實現音頻內容與文本描述之間的高效檢索。

🚀 【ICLR 2024 🔥】LanguageBind: 通過基於語言的語義對齊將視頻-語言預訓練擴展到N模態

LanguageBind是一種以語言為中心的多模態預訓練方法，通過語言綁定不同模態，可輕鬆擴展到分割、檢測等任務。同時，項目提出了包含視頻、紅外、深度、音頻和語言五種模態的VIDAL - 10M數據集，並對語言進行多視圖增強，以提升訓練效果。

🚀 快速開始

如果您喜歡我們的項目，請在GitHub上給我們一個星星 ⭐ 以獲取最新更新。

✨ 主要特性

💡 高性能，無需中間模態

LanguageBind是一種以語言為中心的多模態預訓練方法，以語言作為不同模態之間的紐帶，因為語言模態已經得到了充分的研究，並且包含豐富的語義信息。

下圖展示了LanguageBind的架構。LanguageBind可以輕鬆擴展到分割、檢測任務，並且有可能擴展到無限的模態。

⚡️ 多模態、完全對齊且海量的數據集

我們提出了VIDAL - 10M，這是一個包含1000萬條數據的數據集，涵蓋了視頻（Video）、紅外（Infrared）、深度（Depth）、音頻（Audio）以及它們對應的語言（Language），極大地擴展了視覺模態之外的數據。

第二張圖展示了我們提出的VIDAL - 10M數據集，它包含視頻、紅外、深度、音頻和語言五種模態。

🔥 用於訓練的多視圖增強描述

我們對語言進行了多視圖增強。我們生成了結合元數據、空間和時間的多視圖描述，以極大地增強語言的語義信息。此外，我們還使用ChatGPT進一步增強語言，為每個模態對齊的語言創建一個良好的語義空間。

📦 安裝指南

環境要求

Python >= 3.8
Pytorch >= 1.13.1
CUDA Version >= 11.6

安裝步驟

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

💻 使用示例

基礎用法

我們在assets中提供了一些示例數據集，以便快速瞭解LanguageBind的工作原理。

import torch
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

if __name__ == '__main__':
    device = 'cuda:0'
    device = torch.device(device)
    clip_type = {
        'video': 'LanguageBind_Video_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio_FT',  # also LanguageBind_Audio
        'thermal': 'LanguageBind_Thermal',
        'image': 'LanguageBind_Image',
        'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir')
    model = model.to(device)
    model.eval()
    pretrained_ckpt = f'lb203/LanguageBind_Image'
    tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    image = ['assets/image/0.jpg', 'assets/image/1.jpg']
    audio = ['assets/audio/0.wav', 'assets/audio/1.wav']
    video = ['assets/video/0.mp4', 'assets/video/1.mp4']
    depth = ['assets/depth/0.png', 'assets/depth/1.png']
    thermal = ['assets/thermal/0.jpg', 'assets/thermal/1.jpg']
    language = ["Training a parakeet to climb up a ladder.", 'A lion climbing a tree to catch a monkey.']

    inputs = {
        'image': to_device(modality_transform['image'](image), device),
        'video': to_device(modality_transform['video'](video), device),
        'audio': to_device(modality_transform['audio'](audio), device),
        'depth': to_device(modality_transform['depth'](depth), device),
        'thermal': to_device(modality_transform['thermal'](thermal), device),
    }
    inputs['language'] = to_device(tokenizer(language, max_length=77, padding='max_length',
                                             truncation=True, return_tensors='pt'), device)

    with torch.no_grad():
        embeddings = model(inputs)

    print("Video x Text: \n",
          torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Image x Text: \n",
          torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Depth x Text: \n",
          torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Audio x Text: \n",
          torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Thermal x Text: \n",
          torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，將返回以下結果：

Video x Text: 
 [[9.9989331e-01 1.0667283e-04]
 [1.3255903e-03 9.9867439e-01]]
Image x Text: 
 [[9.9990666e-01 9.3292067e-05]
 [4.6132666e-08 1.0000000e+00]]
Depth x Text: 
 [[0.9954276  0.00457235]
 [0.12042473 0.8795753 ]]
Audio x Text: 
 [[0.97634876 0.02365119]
 [0.02917843 0.97082156]]
Thermal x Text: 
 [[0.9482511  0.0517489 ]
 [0.48746133 0.5125386 ]]

高級用法

應急零樣本

由於LanguageBind將每個模態綁定在一起，我們還發現了應急零樣本的用法。使用起來非常簡單：

print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy())
print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，您將得到：

Video x Audio: 
 [[1.0000000e+00 0.0000000e+00]
 [3.1150486e-32 1.0000000e+00]]
Image x Depth: 
 [[1. 0.]
 [0. 1.]]
Image x Thermal: 
 [[1. 0.]
 [0. 1.]]

不同分支用於跨語言任務

此外，LanguageBind可以分解為不同的分支來處理不同的任務。請注意，我們沒有對圖像進行訓練，只是從OpenCLIP進行初始化。

熱成像（Thermal）

import torch
from languagebind import LanguageBindThermal, LanguageBindThermalTokenizer, LanguageBindThermalProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Thermal'
model = LanguageBindThermal.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindThermalTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
thermal_process = LanguageBindThermalProcessor(model.config, tokenizer)

model.eval()
data = thermal_process([r"your/thermal.jpg"], ['your text'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

深度（Depth）

import torch
from languagebind import LanguageBindDepth, LanguageBindDepthTokenizer, LanguageBindDepthProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Depth'
model = LanguageBindDepth.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindDepthTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
depth_process = LanguageBindDepthProcessor(model.config, tokenizer)

model.eval()
data = depth_process([r"your/depth.png"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

視頻（Video）

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

音頻（Audio）

import torch
from languagebind import LanguageBindAudio, LanguageBindAudioTokenizer, LanguageBindAudioProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Audio_FT'  # also 'LanguageBind/LanguageBind_Audio'
model = LanguageBindAudio.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindAudioTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
audio_process = LanguageBindAudioProcessor(model.config, tokenizer)

model.eval()
data = audio_process([r"your/audio.wav"], ['your audio.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

圖像（Image） 請注意，我們的圖像編碼器與OpenCLIP相同。不像其他模態那樣進行了微調。

import torch
from languagebind import LanguageBindImage,  LanguageBindImageTokenizer,  LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

📚 詳細文檔

📰 新聞動態

[2024.01.27] 👀👀👀 我們的MoE - LLaVA發佈了！一個30億參數的稀疏模型性能超過了70億參數的密集模型。
[2024.01.16] 🔥🔥🔥 我們的LanguageBind已被ICLR 2024接收！我們在這裡這裡獲得了6(3)8(6)6(6)6(6)的評分。
[2023.12.15] 💪💪💪 我們擴展了💥💥💥 VIDAL數據集，現在有1000萬條視頻 - 文本數據。我們推出了LanguageBind_Video 1.5，查看我們的模型庫。
[2023.12.10] 我們擴展了💥💥💥 VIDAL數據集，現在有1000萬條深度數據和1000萬條熱成像數據。我們正在Hugging Face上上傳熱成像和深度數據，預計整個過程將持續1 - 2個月。
[2023.11.27] 🔥🔥🔥 我們更新了我們的論文，包含了應急零樣本結果。查看我們的✨ 結果。
[2023.11.26] 💥💥💥 我們在這裡這裡開源了所有文本來源和相應的YouTube ID。
[2023.11.26] 📣📣📣 我們開源了完全微調的視頻和音頻模型，性能再次得到提升，查看我們的模型庫。
[2023.11.22] 我們即將發佈一個完全微調的版本，大型版本目前正在訓練中。
[2023.11.21] 💥 我們在DATASETS.md中發佈了樣本數據，以便感興趣的人可以進一步修改代碼，在自己的數據上進行訓練。
[2023.11.20] 🚀🚀🚀 Video - LLaVA基於LanguageBind編碼器構建了一個大型視覺 - 語言模型，實現了🎉SOTA性能。
[2023.10.23] 🎶 LanguageBind - Audio在5個數據集上實現了🎉🎉🎉最先進（SOTA）性能，查看我們的✨ 結果！
[2023.10.14] 😱 發佈了更強大的LanguageBind - Video，查看我們的✨ [結果](#視頻 - 語言)！視頻檢查點已在Huggingface模型中心更新！
[2023.10.10] 我們提供了樣本數據，可以在assets中找到，並描述了應急零樣本用法。
[2023.10.07] 檢查點可在🤗 Huggingface模型上獲取。
[2023.10.04] 代碼和演示現已可用！歡迎關注 👀 這個倉庫以獲取最新更新。

🤗 演示

本地演示：強烈建議嘗試我們的網頁演示，它包含了LanguageBind目前支持的所有功能。

python gradio_app.py

在線演示：我們在Huggingface Spaces中提供了在線演示。在這個演示中，您可以計算模態與語言之間的相似度，例如音頻與語言、視頻與語言以及深度與圖像之間的相似度。

🐳 模型庫

表格中的名稱代表不同的編碼器模型。例如，LanguageBind/LanguageBind_Video_FT 代表完全微調的版本，而 LanguageBind/LanguageBind_Video 代表LoRA微調的版本。

您可以在推薦的API用法中自由替換它們。我們建議使用完全微調的版本，因為它具有更強的性能。

模態	LoRA微調	完全微調
視頻	LanguageBind_Video	LanguageBind_Video_FT
音頻	LanguageBind_Audio	LanguageBind_Audio_FT
深度	LanguageBind_Depth	-
熱成像	LanguageBind_Thermal	-

版本	微調方式	模型大小	幀數	Hugging Face鏈接	MSR - VTT	DiDeMo	ActivityNet	MSVD
LanguageBind_Video	LoRA	大型	8	鏈接	42.6	37.8	35.1	52.2
LanguageBind_Video_FT	完全微調	大型	8	鏈接	42.7	38.1	36.9	53.5
LanguageBind_Video_V1.5_FT	完全微調	大型	8	鏈接	42.8	39.7	38.4	54.1
LanguageBind_Video_V1.5_FT	完全微調	大型	12	即將推出	-	-	-	-
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大型	8	鏈接	44.8	39.9	41.0	53.7
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大型	12	即將推出	-	-	-	-

💥 VIDAL - 10M

數據集詳情請參考DATASETS.md。

🗝️ 訓練與驗證

訓練和驗證說明請參考TRAIN_AND_VALIDATE.md。

👍 致謝

OpenCLIP 一個開源的預訓練框架。
CLIP4Clip 一個開源的視頻 - 文本檢索框架。
sRGB - TIR 一個開源的生成紅外（熱成像）圖像的框架。
GLPN 一個開源的生成深度圖像的框架。

📄 許可證

本項目的大部分內容遵循MIT許可證，詳情見LICENSE文件。
本項目的數據集遵循CC - BY - NC 4.0許可證，詳情見DATASET_LICENSE文件。

✏️ 引用

如果您發現我們的論文和代碼在您的研究中很有用，請考慮給我們一個星星 :star: 並進行引用 :pencil:。

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}