Cephalo-Idefics-2-vision-8b-alpha開源模型 - 多模態材料科學促進人機高級交互

首頁

Cephalo Idefics 2 Vision 8b Alpha

由lamm-mit開發

Cephalo是一系列專注於多模態材料科學的視覺大語言模型（V-LLMs），旨在整合視覺和語言數據，以促進人機交互或多智能體AI框架中的高級理解和互動。

圖像生成文本

Transformers

其他開源協議:Apache-2.0 #多模態材料科學 #視覺語言理解 #科學圖像分析

下載量 150

發布時間 : 5/23/2024

模型概述

Cephalo能夠解釋複雜的視覺場景，並生成上下文準確的語言描述和回答查詢。該模型開發用於處理多樣化的輸入，包括圖像和文本，支持廣泛的應用，如圖像字幕生成、視覺問答和多模態內容生成。

模型特點

多模態材料科學理解

專注於整合視覺和語言數據，特別針對材料科學領域的高級理解和互動。

創新的數據集生成方法

採用先進算法從複雜的PDF文檔中準確檢測和分離圖像及其對應的文本描述，確保訓練數據的高質量和上下文相關性。

複雜視覺場景解釋

能夠解釋複雜的視覺場景，並生成上下文準確的語言描述和回答查詢。

多智能體AI框架支持

設計用於促進人機交互或多智能體AI框架中的高級理解和互動。

模型能力

圖像字幕生成

視覺問答

多模態內容生成

材料科學視覺分析

多智能體AI交互

使用案例

材料科學

材料微觀結構分析

分析材料微觀結構的2D和3D渲染，為增材製造方法提供輸入。

提供準確的視覺描述和分析，輔助材料設計。

仿生學應用

通過分析自然界中的行為（如螞蟻攀爬）啟發材料設計和多智能體AI系統開發。

提供仿生學靈感，促進高效和適應性強的運動系統設計。

多智能體AI

多智能體協作系統

分析自然界中的協作行為（如螞蟻群體行為），設計多智能體AI系統。

提供協作行為的視覺理解和語言描述，輔助AI系統設計。

🚀 賽法洛（Cephalo）模型

賽法洛（Cephalo）是一系列專注於材料科學的多模態視覺大語言模型（V - LLMs），旨在整合視覺和語言數據，以實現人類與AI或多智能體AI框架中的高級理解和交互。它能夠處理圖像和文本等多種輸入，在圖像描述、視覺問答和多模態內容生成等領域具有廣泛應用。

🚀 快速開始

環境準備

確保你已經安裝了必要的庫，如torch、transformers、Pillow、requests等。

示例代碼

以下是在GPU上快速開始的代碼示例：

from PIL import Image 
import requests 

DEVICE='cuda:0'

from transformers import AutoProcessor, Idefics2ForConditionalGeneration 
from tqdm.notebook import tqdm
 
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'

model = Idefics2ForConditionalGeneration.from_pretrained(  model_id,
                                                           torch_dtype=torch.bfloat16, #if your GPU allows
                                                           _attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
                                                           trust_remote_code=True,
                                                        ).to (DEVICE)
processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=True
)

更多關於模型優化（包括量化）的內容，請參考後續章節。

簡單推理示例

from transformers.image_utils import load_image

image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

便捷推理函數

def ask_about_image (model, processor, question, 
                     images_input=[], 
                     verbatim=False,
                     temperature=0.1,
                     show_image=False,
                     system="You are a biomaterials scientist who responds accurately. ", 
                     init_instr = "",
                     show_conversation=True,
                     max_new_tokens=256, 
                     messages=[], 
                     images=[], 
                     use_Markdown=False,
                    ):
    
   
    query = question
    images_input=ensure_list(images_input)
    if len (images)==0:
        if len (images_input)>0:
            for image in tqdm (images_input) :
                if is_url(image):
                    image= load_image(image)
                images.append (image)
                
                if show_image:
                    display ( image )
    if len (messages)==0:
       
        base_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": system + init_instr},
                # Image messages will be added dynamically here
                {"type": "text", "text": query}
            ]
        }
        
        # Ensure the images_input is a list
        images_input = ensure_list(images_input)
        
        # Add image messages dynamically
        image_messages = [{"type": "image"} for _ in images_input]
        base_message["content"][1:1] = image_messages  # Insert image messages before the last text message
        
        # Append the constructed message to messages list
        messages.append(base_message)

    else:
        messages.append (
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": query
                    }
                ]
            }
        )
    if verbatim:
        print (messages)
        
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
     
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
    generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)

    messages.append (
        {
            "role": "assistant",
            "content": [ {"type": "text", "text": generated_texts[0]}, ]
        }
    )
    formatted_conversation = format_conversation(messages, images)
    
    # Display the formatted conversation, e.g. in Jupyter Notebook
    if show_conversation:
     
        if use_Markdown:
            display(Markdown(formatted_conversation))
        else:
            display(HTML(formatted_conversation))

    return generated_texts, messages, images

question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."

url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg" 

response, messages,images= ask_about_image ( model, processor, question, 
                                             images_input=[url1,],
                                             temperature=0.1,
                                             system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n', 
                                             show_conversation=True,
                                             max_new_tokens=512, messages=[], images=[])

示例輸出

image/png 圖片來源：Vaishakh Manohar

The image depicts a group of ants moving in a coordinated manner to climb a vertical surface. This behavior is known as cooperative climbing and involves the use of multiple agents working together to achieve a common goal. The relevance for materials design lies in the potential application of multi-agent AI in developing new materials with improved properties through the collaboration of multiple agents.

✨ 主要特性

多模態融合：能夠整合視覺和語言數據，實現對複雜場景的理解和交互。
創新數據集生成：採用先進算法從複雜PDF文檔中提取圖像和文本描述，生成高質量的圖像 - 文本對用於訓練。
廣泛應用：可用於圖像描述、視覺問答、多模態內容生成等多個領域。
靈活輸入：支持處理圖像和文本等多種輸入。

📦 安裝指南

確保你已經安裝了Python環境，然後可以使用以下命令安裝所需的庫：

pip install torch transformers Pillow requests

💻 使用示例

基礎用法

上述快速開始部分的代碼示例展示瞭如何加載模型、處理輸入並進行推理。

高級用法

手動設置聊天模板

IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer

聊天格式示例

單輪對話：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:

多輪對話：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:

📚 詳細文檔

模型概述

賽法洛（Cephalo）模型是一系列專注於材料科學的多模態視覺大語言模型，它結合了視覺編碼器模型和自迴歸變壓器，能夠處理複雜的自然語言理解任務。該模型基於HuggingFaceM4/idefics2 - 8b - chatty模型開發，在從維基百科和科學論文中提取的科學文本 - 圖像數據上進行訓練。

數據集生成

訓練視覺模型的數據集生成方法採用先進算法，從複雜PDF文檔中準確檢測和分離圖像及其對應的文本描述。具體步驟包括從PDF中提取圖像和標題，利用大語言模型（LLMs）進行自然語言處理，創建合理的圖像 - 文本對，然後通過基於LLM的NLP處理對這些圖像 - 文本對進行精煉和驗證，確保訓練數據的高質量和上下文相關性。

模型架構

模型架構結合了視覺編碼器模型和自迴歸變壓器，以處理複雜的自然語言理解任務。

模型應用

賽法洛模型可用於多種應用場景，如：

圖像描述：為圖像生成準確的文本描述。
視覺問答：回答關於圖像的問題。
多模態內容生成：根據圖像和文本輸入生成相關的多模態內容。

模型優化

半精度推理

如果你的GPU支持，可以使用半精度（torch.float16或torch.bfloat16）加載和運行推理：

model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.float16,    
).to(DEVICE)

視覺編碼器效率優化

如果你的GPU內存有限，可以採取以下措施：

停用圖像分割：在初始化處理器（AutoProcessor.from_pretrained）時添加do_image_splitting=False。

processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=False
)

降低最大圖像分辨率：在初始化處理器時添加size= {"longest_edge": 448, "shortest_edge": 378}，並可根據需要調整longest_edge的值（默認值為980），建議使用14的倍數。

processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    size= {"longest_edge": 448, "shortest_edge": 378}
)

使用Flash - attention 2加速生成

確保安裝了flash - attn，並在加載模型時添加_attn_implementation="flash_attention_2"：

model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.bfloat16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)

4位量化

使用bitsandbytes庫進行4位量化，確保安裝了accelerate和bitsandbytes：

+ from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.bfloat16,    
+    quantization_config=quantization_config,
).to(DEVICE)

🔧 技術細節

模型架構

賽法洛模型結合了視覺編碼器模型和自迴歸變壓器，用於處理複雜的自然語言理解任務。視覺編碼器負責處理圖像輸入，自迴歸變壓器則用於生成語言輸出。

數據集生成

數據集生成過程採用先進算法從複雜PDF文檔中提取圖像和文本描述。具體步驟包括：

圖像和標題提取：從PDF中提取圖像和對應的標題。
自然語言處理：利用大語言模型（LLMs）對提取的文本進行處理，創建圖像 - 文本對。
精煉和驗證：通過基於LLM的NLP處理對圖像 - 文本對進行精煉和驗證，確保數據的高質量和上下文相關性。

訓練數據

模型在從維基百科和科學論文中提取的科學文本 - 圖像數據上進行訓練。

模型優化

模型優化包括半精度推理、視覺編碼器效率優化、使用Flash - attention 2加速生成和4位量化等技術，以提高模型的性能和效率。

📄 許可證

本項目採用Apache 2.0許可證。

📖 引用

請按以下格式引用本模型：

@article{Buehler_Cephalo_2024,
  title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
  author={Markus J. Buehler},
  journal={arXiv preprint arXiv:2405.19076},
  year={2024}
}