開源LanguageBind_Thermal模型 - 支持多模態與語言聯合學習，輕鬆語義對齊

首頁

Languagebind Thermal

由LanguageBind開發

LanguageBind是一個通過語言作為紐帶實現多模態語義對齊的預訓練框架，支持視頻、紅外、深度、音頻等多種模態與語言的聯合學習。

多模態對齊

Transformers

開源協議:MIT #多模態對齊 #零樣本學習 #語義增強

下載量 887

發布時間 : 10/6/2023

模型概述

該模型通過語言模態作為中心紐帶，將視頻、音頻、紅外、深度等多種模態的語義空間對齊，實現跨模態的理解與生成能力。

模型特點

語言為中心的多模態對齊

以語言模態為紐帶實現視頻、音頻、紅外、深度等多種模態的語義空間對齊

海量多模態數據集

提供VIDAL-10M數據集，包含1000萬視頻、紅外、深度、音頻及對應語言數據

多視角語言增強

融合元數據、空間和時序信息構建多視角描述，並通過ChatGPT優化語義表達

靈活擴展性

架構設計支持輕鬆擴展到分割、檢測等任務，理論上支持無限模態

模型能力

跨模態檢索

視頻-語言理解

音頻-語言理解

紅外圖像理解

深度圖像理解

多模態聯合表徵學習

使用案例

智能監控

多模態異常檢測

結合視頻、紅外和深度數據檢測異常行為

提升複雜環境下的檢測準確率

自動駕駛

環境感知增強

融合視覺、熱成像和深度數據理解道路場景

改善夜間和惡劣天氣條件下的感知能力

人機交互

多模態指令理解

同時處理語音指令和視覺場景

實現更自然的人機交互體驗

🚀 【ICLR 2024 🔥】LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

LanguageBind是一種以語言為中心的多模態預訓練方法，通過基於語言的語義對齊，將視頻-語言預訓練擴展到N種模態，為多模態任務提供了強大的支持。

📦 安裝指南

環境要求
- Python >= 3.8
- Pytorch >= 1.13.1
- CUDA Version >= 11.6
安裝步驟

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

🚀 快速開始

📰 項目動態

[2024.01.27] 👀👀👀 我們的 MoE-LLaVA 發佈！一個30億參數的稀疏模型性能超過了70億參數的密集模型。
[2024.01.16] 🔥🔥🔥 我們的 LanguageBind 被 ICLR 2024 接收！我們獲得了6(3)8(6)6(6)6(6)的評分詳情。
[2023.12.15] 💪💪💪 我們擴展了 💥💥💥 VIDAL 數據集，現在有 1000萬視頻-文本數據。我們發佈了 LanguageBind_Video 1.5，查看我們的模型庫。
[2023.12.10] 我們擴展了 💥💥💥 VIDAL 數據集，現在有 1000萬深度數據和1000萬熱成像數據。我們正在 Hugging Face 上上傳熱成像和深度數據，預計整個過程將持續1 - 2個月。
[2023.11.27] 🔥🔥🔥 我們更新了論文，包含緊急零樣本結果，查看我們的 ✨ 結果。
[2023.11.26] 💥💥💥 我們開源了所有文本源和對應的 YouTube ID 詳情。
[2023.11.26] 📣📣📣 我們開源了完全微調的 視頻 & 音頻 模型，性能再次提升，查看我們的模型庫。
[2023.11.22] 我們即將發佈完全微調版本，超大版本 目前正在訓練中。
[2023.11.21] 💥 我們在 DATASETS.md 中發佈了示例數據，感興趣的人可以進一步修改代碼，在自己的數據上進行訓練。
[2023.10.23] 🎶 LanguageBind-Audio 在5個數據集上取得了 🎉🎉🎉最優 (SOTA) 性能，查看我們的 ✨ 結果！
[2023.10.14] 😱 發佈了更強的 LanguageBind-Video，查看我們的 ✨ 結果！視頻檢查點已在 Huggingface 模型中心更新！
[2023.10.10] 我們提供了示例數據，可在 assets 中找到，並描述了緊急零樣本使用方法。
[2023.10.07] 檢查點可在 🤗 Huggingface 模型上獲取。
[2023.10.04] 代碼和演示現已可用！歡迎關注 👀 此倉庫以獲取最新更新。

✨ 主要特性

💡 高性能，無需中間模態

LanguageBind是一種 以語言為中心 的多模態預訓練方法，以語言作為不同模態之間的紐帶，因為語言模態已經得到了充分的探索，並且包含豐富的語義。

下圖展示了 LanguageBind 的架構。LanguageBind 可以輕鬆擴展到分割、檢測任務，並且有可能擴展到無限的模態。

⚡️ 多模態、完全對齊且海量的數據集

我們提出了 VIDAL-10M，即包含視頻、紅外、深度、音頻及其對應的語言的 1000萬條數據，大大擴展了視覺模態之外的數據。

第二張圖展示了我們提出的 VIDAL-10M 數據集，它包括視頻、紅外、深度、音頻和語言五種模態。

🔥 用於訓練的多視圖增強描述

我們對語言進行了多視圖增強。我們生成了結合 元數據、空間和時間的多視圖描述，大大增強了語言的語義信息。此外，我們還進一步 使用 ChatGPT 增強語言，為每個模態對齊的語言創建了一個良好的語義空間。

💻 使用示例

基礎用法

import torch
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

if __name__ == '__main__':
    device = 'cuda:0'
    device = torch.device(device)
    clip_type = {
        'video': 'LanguageBind_Video_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio_FT',  # also LanguageBind_Audio
        'thermal': 'LanguageBind_Thermal',
        'image': 'LanguageBind_Image',
        'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir')
    model = model.to(device)
    model.eval()
    pretrained_ckpt = f'lb203/LanguageBind_Image'
    tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    image = ['assets/image/0.jpg', 'assets/image/1.jpg']
    audio = ['assets/audio/0.wav', 'assets/audio/1.wav']
    video = ['assets/video/0.mp4', 'assets/video/1.mp4']
    depth = ['assets/depth/0.png', 'assets/depth/1.png']
    thermal = ['assets/thermal/0.jpg', 'assets/thermal/1.jpg']
    language = ["Training a parakeet to climb up a ladder.", 'A lion climbing a tree to catch a monkey.']

    inputs = {
        'image': to_device(modality_transform['image'](image), device),
        'video': to_device(modality_transform['video'](video), device),
        'audio': to_device(modality_transform['audio'](audio), device),
        'depth': to_device(modality_transform['depth'](depth), device),
        'thermal': to_device(modality_transform['thermal'](thermal), device),
    }
    inputs['language'] = to_device(tokenizer(language, max_length=77, padding='max_length',
                                             truncation=True, return_tensors='pt'), device)

    with torch.no_grad():
        embeddings = model(inputs)

    print("Video x Text: \n",
          torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Image x Text: \n",
          torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Depth x Text: \n",
          torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Audio x Text: \n",
          torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Thermal x Text: \n",
          torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後會返回以下結果：

Video x Text: 
 [[9.9989331e-01 1.0667283e-04]
 [1.3255903e-03 9.9867439e-01]]
Image x Text: 
 [[9.9990666e-01 9.3292067e-05]
 [4.6132666e-08 1.0000000e+00]]
Depth x Text: 
 [[0.9954276  0.00457235]
 [0.12042473 0.8795753 ]]
Audio x Text: 
 [[0.97634876 0.02365119]
 [0.02917843 0.97082156]]
Thermal x Text: 
 [[0.9482511  0.0517489 ]
 [0.48746133 0.5125386 ]]

高級用法

# 緊急零樣本使用方法，由於 languagebind 將每個模態綁定在一起，我們還發現了“緊急零樣本”用法，使用非常簡單
print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy())
print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後會得到以下結果：

Video x Audio: 
 [[1.0000000e+00 0.0000000e+00]
 [3.1150486e-32 1.0000000e+00]]
Image x Depth: 
 [[1. 0.]
 [0. 1.]]
Image x Thermal: 
 [[1. 0.]
 [0. 1.]]

不同分支用於跨語言任務

此外，LanguageBind 可以 分解為不同的分支 來處理不同的任務。請注意，我們的圖像編碼器未進行微調，與 OpenCLIP 相同。

熱成像分支

import torch
from languagebind import LanguageBindThermal, LanguageBindThermalTokenizer, LanguageBindThermalProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Thermal'
model = LanguageBindThermal.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindThermalTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
thermal_process = LanguageBindThermalProcessor(model.config, tokenizer)

model.eval()
data = thermal_process([r"your/thermal.jpg"], ['your text'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

深度分支

import torch
from languagebind import LanguageBindDepth, LanguageBindDepthTokenizer, LanguageBindDepthProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Depth'
model = LanguageBindDepth.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindDepthTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
depth_process = LanguageBindDepthProcessor(model.config, tokenizer)

model.eval()
data = depth_process([r"your/depth.png"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

視頻分支

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

音頻分支

import torch
from languagebind import LanguageBindAudio, LanguageBindAudioTokenizer, LanguageBindAudioProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Audio_FT'  # also 'LanguageBind/LanguageBind_Audio'
model = LanguageBindAudio.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindAudioTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
audio_process = LanguageBindAudioProcessor(model.config, tokenizer)

model.eval()
data = audio_process([r"your/audio.wav"], ['your audio.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

圖像分支

請注意，我們的圖像編碼器與 OpenCLIP 相同，未像其他模態那樣進行微調。

import torch
from languagebind import LanguageBindImage,  LanguageBindImageTokenizer,  LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

📚 詳細文檔

🤗 演示

本地演示：強烈建議嘗試我們的網頁演示，它集成了 LanguageBind 目前支持的所有功能。

python gradio_app.py

在線演示：我們在 Huggingface Spaces 上提供了在線演示。在這個演示中，你可以計算模態與語言之間的相似度，例如音頻與語言、視頻與語言、深度與圖像之間的相似度。

🐳 模型庫

以下表格展示了不同模態的編碼器模型，表格中的名稱代表不同的編碼器模型。例如，LanguageBind/LanguageBind_Video_FT 代表完全微調版本，而 LanguageBind/LanguageBind_Video 代表 LoRA 微調版本。你可以在推薦的 API 用法中自由替換它們。我們建議使用完全微調版本，因為它具有更強的性能。

模態	LoRA 微調	完全微調
視頻	LanguageBind_Video	LanguageBind_Video_FT
音頻	LanguageBind_Audio	LanguageBind_Audio_FT
深度	LanguageBind_Depth	-
熱成像	LanguageBind_Thermal	-

以下表格展示了不同版本的視頻模型的詳細信息：

版本	微調方式	模型大小	幀數	HF 鏈接	MSR-VTT	DiDeMo	ActivityNet	MSVD
LanguageBind_Video	LoRA	大	8	鏈接	42.6	37.8	35.1	52.2
LanguageBind_Video_FT	完全微調	大	8	鏈接	42.7	38.1	36.9	53.5
LanguageBind_Video_V1.5_FT	完全微調	大	8	鏈接	42.8	39.7	38.4	54.1
LanguageBind_Video_V1.5_FT	完全微調	大	12	即將推出	-	-	-	-
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大	8	鏈接	44.8	39.9	41.0	53.7
LanguageBind_Video_Huge_V1.5_FT	完全微調	超大	12	即將推出	-	-	-	-

💥 VIDAL-10M

數據集詳情見 DATASETS.md。

🗝️ 訓練與驗證

訓練與驗證說明見 TRAIN_AND_VALIDATE.md。

👍 致謝

OpenCLIP 一個開源預訓練框架。
CLIP4Clip 一個開源視頻-文本檢索框架。
sRGB-TIR 一個用於生成紅外（熱成像）圖像的開源框架。
GLPN 一個用於生成深度圖像的開源框架。

📄 許可證

本項目的大部分內容遵循 MIT 許可證，詳情見 LICENSE 文件。
本項目的數據集遵循 CC-BY-NC 4.0 許可證，詳情見 DATASET_LICENSE 文件。

✏️ 引用

如果您發現我們的論文和代碼對您的研究有用，請考慮給我們點個星星 :star: 並進行引用 :pencil:。

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}