LanguageBind_Video_Huge_V1.5_FT開源模型 - 實現多模態與語言綁定，支持跨模態理解檢索

首頁

Languagebind Video Huge V1.5 FT

由LanguageBind開發

LanguageBind 是一種通過語言實現多模態語義對齊的預訓練模型，能夠將視頻、音頻、深度、熱成像等多種模態與語言進行綁定，實現跨模態的理解和檢索。

多模態對齊

Transformers

開源協議:MIT #多模態對齊 #零樣本學習 #視頻語言預訓練

下載量 2,711

發布時間 : 12/15/2023

模型概述

LanguageBind 採用以語言為核心的多模態預訓練範式，通過語言橋接不同模態，充分利用語言模態的豐富語義。該模型支持視頻、音頻、深度、熱成像等多種模態與語言的交互。

模型特點

語言為核心的多模態對齊

通過語言作為橋樑實現不同模態間的語義對齊，無需中間模態轉換

支持多種模態

可處理視頻、音頻、深度圖、熱成像等多種模態數據

海量訓練數據

使用VIDAL-10M數據集，包含1000萬條多模態對齊數據

高性能跨模態檢索

在多個基準測試中達到最先進性能

模型能力

視頻-語言檢索

音頻-語言檢索

深度圖-語言檢索

熱成像-語言檢索

多模態相似度計算

跨模態語義理解

使用案例

視頻理解

視頻內容檢索

根據文本描述檢索相關視頻片段

在MSR-VTT數據集上達到44.8%的檢索準確率

音頻分析

音頻事件檢測

通過文本描述識別音頻中的特定事件

在多個音頻數據集上達到最先進性能

特殊視覺模態處理

熱成像分析

理解熱成像圖像並與文本描述對齊

深度圖理解

解析深度圖信息並與語言描述匹配

🚀 【ICLR 2024 🔥】LanguageBind: 通過基於語言的語義對齊將視頻-語言預訓練擴展到 N 模態

LanguageBind 是一種以語言為中心的多模態預訓練方法，藉助語言作為不同模態間的紐帶，可輕鬆擴展到分割、檢測等任務，還提出了包含視頻、紅外、深度、音頻及對應語言的 VIDAL - 10M 大規模數據集，並通過多視圖增強語言描述，提升訓練效果。

📄 新聞動態

[2024.01.27] 👀👀👀 我們的 [MoE - LLaVA](https://github.com/PKU - YuanGroup/MoE - LLaVA) 發佈啦！一個 30 億參數的稀疏模型性能超越了 70 億參數的密集模型。
[2024.01.16] 🔥🔥🔥 我們的 LanguageBind 已被 ICLR 2024 接收！我們在此鏈接獲得了 6(3)8(6)6(6)6(6) 的評分。
[2023.12.15] 💪💪💪 我們擴展了 💥💥💥 VIDAL 數據集，現在擁有 1000 萬條視頻 - 文本數據。我們推出了 LanguageBind_Video 1.5，查看我們的模型庫。
[2023.12.10] 我們擴展了 💥💥💥 VIDAL 數據集，現在擁有 1000 萬條深度數據和 1000 萬條熱成像數據。我們正在 [Hugging Face](https://huggingface.co/datasets/LanguageBind/VIDAL - Depth - Thermal) 上上傳熱成像和深度數據，預計整個過程將持續 1 - 2 個月。
[2023.11.27] 🔥🔥🔥 我們更新了論文，包含緊急零樣本結果，查看我們的 ✨ 結果。
[2023.11.26] 💥💥💥 我們開源了所有文本源和對應的 YouTube ID，詳情見此處。
[2023.11.26] 📣📣📣 我們開源了全量微調的 視頻 & 音頻 模型，性能再次提升，查看我們的模型庫。
[2023.11.22] 我們即將發佈全量微調版本，超大版本 目前正在訓練中。
[2023.11.20] 🚀🚀🚀 [Video - LLaVA](https://github.com/PKU - YuanGroup/Video - LLaVA) 基於 LanguageBind 編碼器構建了一個大型視覺 - 語言模型，實現了 🎉 最優性能。
[2023.10.23] 🎶 LanguageBind - Audio 在 5 個數據集上實現了 🎉🎉🎉 最優 (SOTA) 性能，查看我們的 ✨ 結果!
[2023.10.14] 😱 發佈了更強的 LanguageBind - Video，查看我們的 ✨ [結果](#視頻 - 語言)! 視頻檢查點已在 Huggingface 模型中心更新！
[2023.10.10] 我們提供了示例數據，可在 assets 中找到，並描述了緊急零樣本使用方法。
[2023.10.07] 檢查點可在 🤗 Huggingface 模型上獲取。
[2023.10.04] 代碼和演示現已可用！歡迎關注 👀 此倉庫以獲取最新更新。

✨ 主要特性

💡 高性能，無需中間模態

LanguageBind 是一種 以語言為中心 的多模態預訓練方法，以語言作為不同模態之間的紐帶，因為語言模態已經得到了充分探索，並且包含豐富的語義信息。

下圖展示了 LanguageBind 的架構。LanguageBind 可以輕鬆擴展到分割、檢測任務，並且有可能擴展到無限的模態。

⚡️ 多模態、完全對齊且數據量大的數據集

我們提出了 VIDAL - 10M，這是一個包含視頻、紅外、深度、音頻及其對應語言的 1000 萬條數據 的數據集，極大地擴展了視覺模態之外的數據。

第二張圖展示了我們提出的 VIDAL - 10M 數據集，它包括視頻、紅外、深度、音頻和語言五種模態。

🔥 用於訓練的多視圖增強描述

我們對語言進行了多視圖增強。我們生成了結合 元數據、空間和時間的多視圖描述，以極大地增強語言的語義信息。此外，我們還使用 ChatGPT 進一步 增強語言，為每個模態對齊的語言創建一個良好的語義空間。

🤗 演示

本地演示

強烈推薦嘗試我們的網頁演示，它集成了 LanguageBind 當前支持的所有功能。

python gradio_app.py

在線演示

我們在 Huggingface Spaces 上提供了在線演示。在這個演示中，你可以計算模態與語言之間的相似度，例如音頻與語言、視頻與語言以及深度與圖像之間的相似度。

📦 安裝指南

Python >= 3.8
Pytorch >= 1.13.1
CUDA 版本 >= 11.6
安裝所需的包：

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

🐳 模型庫

表格中的名稱代表不同的編碼器模型。例如，LanguageBind/LanguageBind_Video_FT 代表全量微調版本，而 LanguageBind/LanguageBind_Video 代表 LoRA 微調版本。

你可以在推薦的 API 使用方法中自由替換它們。我們建議使用全量微調版本，因為它具有更強的性能。

模態	LoRA 微調	全量微調
視頻	LanguageBind_Video	LanguageBind_Video_FT
音頻	LanguageBind_Audio	LanguageBind_Audio_FT
深度	LanguageBind_Depth	-
熱成像	LanguageBind_Thermal	-

版本	微調方式	模型大小	幀數	HF 鏈接	MSR - VTT	DiDeMo	ActivityNet	MSVD
LanguageBind_Video	LoRA	大	8	鏈接	42.6	37.8	35.1	52.2
LanguageBind_Video_FT	全量微調	大	8	鏈接	42.7	38.1	36.9	53.5
LanguageBind_Video_V1.5_FT	全量微調	大	8	鏈接	42.8	39.7	38.4	54.1
LanguageBind_Video_V1.5_FT	全量微調	大	12	即將推出	-	-	-	-
LanguageBind_Video_Huge_V1.5_FT	全量微調	超大	8	鏈接	44.8	39.9	41.0	53.7
LanguageBind_Video_Huge_V1.5_FT	全量微調	超大	12	即將推出	-	-	-	-

💻 使用示例

基礎用法

我們在 assets 中提供了一些示例數據集，以便快速瞭解 LanguageBind 的工作原理。

import torch
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

if __name__ == '__main__':
    device = 'cuda:0'
    device = torch.device(device)
    clip_type = {
        'video': 'LanguageBind_Video_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio_FT',  # also LanguageBind_Audio
        'thermal': 'LanguageBind_Thermal',
        'image': 'LanguageBind_Image',
        'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir')
    model = model.to(device)
    model.eval()
    pretrained_ckpt = f'lb203/LanguageBind_Image'
    tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    image = ['assets/image/0.jpg', 'assets/image/1.jpg']
    audio = ['assets/audio/0.wav', 'assets/audio/1.wav']
    video = ['assets/video/0.mp4', 'assets/video/1.mp4']
    depth = ['assets/depth/0.png', 'assets/depth/1.png']
    thermal = ['assets/thermal/0.jpg', 'assets/thermal/1.jpg']
    language = ["Training a parakeet to climb up a ladder.", 'A lion climbing a tree to catch a monkey.']

    inputs = {
        'image': to_device(modality_transform['image'](image), device),
        'video': to_device(modality_transform['video'](video), device),
        'audio': to_device(modality_transform['audio'](audio), device),
        'depth': to_device(modality_transform['depth'](depth), device),
        'thermal': to_device(modality_transform['thermal'](thermal), device),
    }
    inputs['language'] = to_device(tokenizer(language, max_length=77, padding='max_length',
                                             truncation=True, return_tensors='pt'), device)

    with torch.no_grad():
        embeddings = model(inputs)

    print("Video x Text: \n",
          torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Image x Text: \n",
          torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Depth x Text: \n",
          torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Audio x Text: \n",
          torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Thermal x Text: \n",
          torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，將返回以下結果：

Video x Text: 
 [[9.9989331e-01 1.0667283e-04]
 [1.3255903e-03 9.9867439e-01]]
Image x Text: 
 [[9.9990666e-01 9.3292067e-05]
 [4.6132666e-08 1.0000000e+00]]
Depth x Text: 
 [[0.9954276  0.00457235]
 [0.12042473 0.8795753 ]]
Audio x Text: 
 [[0.97634876 0.02365119]
 [0.02917843 0.97082156]]
Thermal x Text: 
 [[0.9482511  0.0517489 ]
 [0.48746133 0.5125386 ]]

高級用法

由於 LanguageBind 將每個模態綁定在一起，我們還發現了 緊急零樣本 功能。使用起來非常簡單。

print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy())
print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy())

運行上述代碼後，你將得到：

Video x Audio: 
 [[1.0000000e+00 0.0000000e+00]
 [3.1150486e-32 1.0000000e+00]]
Image x Depth: 
 [[1. 0.]
 [0. 1.]]
Image x Thermal: 
 [[1. 0.]
 [0. 1.]]

🤖 API

我們開源了所有模態的預處理代碼。如果你想從 Huggingface 模型中心或本地加載模型（例如 LanguageBind/LanguageBind_Thermal），可以使用以下代碼片段！

熱成像

import torch
from languagebind import LanguageBindThermal, LanguageBindThermalTokenizer, LanguageBindThermalProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Thermal'
model = LanguageBindThermal.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindThermalTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
thermal_process = LanguageBindThermalProcessor(model.config, tokenizer)

model.eval()
data = thermal_process([r"your/thermal.jpg"], ['your text'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

深度

import torch
from languagebind import LanguageBindDepth, LanguageBindDepthTokenizer, LanguageBindDepthProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Depth'
model = LanguageBindDepth.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindDepthTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
depth_process = LanguageBindDepthProcessor(model.config, tokenizer)

model.eval()
data = depth_process([r"your/depth.png"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

視頻

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

音頻

import torch
from languagebind import LanguageBindAudio, LanguageBindAudioTokenizer, LanguageBindAudioProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Audio_FT'  # also 'LanguageBind/LanguageBind_Audio'
model = LanguageBindAudio.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindAudioTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
audio_process = LanguageBindAudioProcessor(model.config, tokenizer)

model.eval()
data = audio_process([r"your/audio.wav"], ['your audio.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

圖像

請注意，我們的圖像編碼器與 OpenCLIP 相同。不像其他模態那樣經過微調。

import torch
from languagebind import LanguageBindImage,  LanguageBindImageTokenizer,  LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

💥 VIDAL - 10M

數據集詳情請見 DATASETS.md。

🗝️ 訓練與驗證

訓練和驗證說明請見 TRAIN_AND_VALIDATE.md。

👍 致謝

OpenCLIP 一個開源的預訓練框架。
CLIP4Clip 一個開源的視頻 - 文本檢索框架。
[sRGB - TIR](https://github.com/rpmsnu/sRGB - TIR) 一個用於生成紅外（熱成像）圖像的開源框架。
GLPN 一個用於生成深度圖像的開源框架。

🔒 許可證

本項目的大部分內容遵循 MIT 許可證，詳情見 [LICENSE](https://github.com/PKU - YuanGroup/LanguageBind/blob/main/LICENSE) 文件。
本項目的數據集遵循 CC - BY - NC 4.0 許可證，詳情見 [DATASET_LICENSE](https://github.com/PKU - YuanGroup/LanguageBind/blob/main/DATASET_LICENSE) 文件。

✏️ 引用

如果您發現我們的論文和代碼在您的研究中很有用，請考慮給我們點個星 :star: 並引用 :pencil:。

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

✨ 星標歷史

[![Star History](https://api.star - history.com/svg?repos=PKU - YuanGroup/LanguageBind&type=Date)](https://star - history.com/#PKU - YuanGroup/LanguageBind&Date)