MoE-LLaVA-Qwen-1.8B-4e開源視覺語言模型 - 支持高效多模態學習應用

Home

Moe LLaVA Qwen 1.8B 4e

Developed by LanguageBind

MoE-LLaVA是一種基於專家混合架構的大型視覺語言模型，通過稀疏激活參數實現高效的多模態學習

文本生成圖像

Transformers

Open Source License:Apache-2.0 #專家混合架構 #稀疏激活參數 #多模態學習

Downloads 176

Release Time : 1/23/2024

Model Overview

MoE-LLaVA結合了視覺和語言理解能力，採用專家混合架構實現高效的多模態交互，在減少參數量的同時保持高性能

Model Features

高效參數利用

僅需30億稀疏激活參數即可達到7B密集模型的性能

快速訓練

在8張V100顯卡上2天內完成訓練

卓越性能

在多項視覺理解任務上超越更大規模的模型

Model Capabilities

視覺問答

圖像理解

多模態推理

物體識別

圖像描述生成

Use Cases

智能助手

圖像內容問答

回答用戶關於圖像內容的各類問題

在物體幻覺基準測試中超越LLaVA-1.5-13B

內容理解

複雜場景理解

理解包含多個對象的複雜場景圖像

在多項視覺理解數據集上達到LLaVA-1.5-7B相當水平

🚀 MoE-LLaVA：面向大視覺語言模型的專家混合模型

MoE-LLaVA是一種用於大視覺語言模型的專家混合模型，在多模態學習中表現出色。它以較少的參數實現了高性能，通過簡單的基線和稀疏路徑學習多模態交互。

🚀 快速開始

試用演示

你可以通過以下命令試用我們的Web演示，該演示集成了MoE-LLaVA目前支持的所有功能。我們也在Huggingface Spaces上提供了在線演示。

# 使用phi2
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e" 
# 使用qwen
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e" 
# 使用stablelm
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e"

命令行推理

# 使用phi2
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e"  --image-file "image.jpg"
# 使用qwen
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e"  --image-file "image.jpg"
# 使用stablelm
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e"  --image-file "image.jpg"

本地加載模型

如果你想在本地加載模型（例如 LanguageBind/MoE-LLaVA），可以使用以下代碼片段。使用以下命令運行代碼：

deepspeed predict.py

import torch
from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from moellava.conversation import conv_templates, SeparatorStyle
from moellava.model.builder import load_pretrained_model
from moellava.utils import disable_torch_init
from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'moellava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/MoE-LLaVA-Phi2-2.7B-4e'  # LanguageBind/MoE-LLaVA-Qwen-1.8B-4e or LanguageBind/MoE-LLaVA-StableLM-1.6B-4e
    device = 'cuda'
    load_4bit, load_8bit = False, False  # FIXME: Deepspeed support 4bit or 8bit?
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    image_processor = processor['image']
    conv_mode = "phi"  # qwen or stablelm
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()
    print(outputs)

if __name__ == '__main__':
    main()

✨ 主要特性

🔥 高性能與少參數

僅使用 30億稀疏激活參數，MoE-LLaVA在各種視覺理解數據集上表現出與LLaVA-1.5-7B相當的性能，甚至在物體幻覺基準測試中超越了LLaVA-1.5-13B。

🚀 簡單基線與稀疏路徑學習多模態交互

通過添加 一個簡單的MoE調優階段，我們可以在 8個V100 GPU 上在2天內完成MoE-LLaVA的訓練。

📦 安裝指南

環境要求

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Transformers == 4.36.2
Tokenizers==0.15.1

安裝步驟

git clone https://github.com/PKU-YuanGroup/MoE-LLaVA
cd MoE-LLaVA
conda create -n moellava python=3.10 -y
conda activate moellava
pip install --upgrade pip  # 啟用PEP 660支持
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

# 以下為可選步驟，針對Qwen模型
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 以下為可選步驟，安裝可能較慢
# pip install csrc/layer_norm
# 如果flash-attn版本高於2.1.1，則不需要以下步驟
# pip install csrc/rotary

📚 詳細文檔

訓練與驗證

訓練和驗證說明請參考 TRAIN.md 和 EVAL.md。

自定義模型

自定義MoE-LLaVA的說明請參考 CUSTOM.md。

可視化

可視化說明請參考 VISUALIZATION.md。

🤖 API

我們開源了所有代碼。如果你想在本地加載模型（例如 LanguageBind/MoE-LLaVA），可以使用上述提供的代碼片段。

🔧 技術細節

MoE-LLaVA在多模態學習中展現出卓越的性能。它通過引入專家混合（MoE）機制，以較少的參數實現了高性能。在訓練過程中，通過簡單的MoE調優階段，結合稀疏路徑學習多模態交互，能夠在8個V100 GPU上在2天內完成訓練。在各種視覺理解數據集上，MoE-LLaVA表現出與更大參數模型相當甚至更優的性能。

📦 模型庫

模型	大語言模型	檢查點	平均	VQAv2	GQA	VizWiz	SQA	T-VQA	POPE	MM-Bench	LLaVA-Bench-Wild	MM-Vet
MoE-LLaVA-1.6B×4-Top2	16億	LanguageBind/MoE-LLaVA-StableLM-1.6B-4e	60.0	76.0	60.4	37.2	62.6	47.8	84.3	59.4	85.9	26.1
MoE-LLaVA-1.8B×4-Top2	18億	LanguageBind/MoE-LLaVA-Qwen-1.8B-4e	60.2	76.2	61.5	32.6	63.1	48.0	87.0	59.6	88.7	25.3
MoE-LLaVA-2.7B×4-Top2	27億	LanguageBind/MoE-LLaVA-Phi2-2.7B-4e	63.9	77.1	61.1	43.4	68.7	50.2	85.0	65.5	93.2	31.1

🙌 相關項目

Video-LLaVA 該框架使模型能夠有效利用統一的視覺令牌。
LanguageBind 一個開源的五模態基於語言的檢索框架。

👍 致謝

LLaVA 我們基於該代碼庫進行開發，它是一個高效的大語言和視覺助手。

📄 許可證

本項目的大部分內容遵循Apache 2.0許可證，詳情見 LICENSE 文件。
本服務僅供研究預覽，僅用於非商業用途，需遵守LLaMA的模型許可證、OpenAI生成數據的使用條款以及ShareGPT的隱私政策。如果發現任何潛在的違規行為，請聯繫我們。

✏️ 引用

如果你在研究中發現我們的論文和代碼有用，請考慮給我們一個星星 :star: 並進行引用 :pencil:。

@misc{lin2024moellava,
      title={MoE-LLaVA: Mixture of Experts for Large Vision-Language Models}, 
      author={Bin Lin and Zhenyu Tang and Yang Ye and Jiaxi Cui and Bin Zhu and Peng Jin and Junwu Zhang and Munan Ning and Li Yuan},
      year={2024},
      eprint={2401.15947},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}