Cephalo Idefics 2 Vision 8b Alpha
模型概述
模型特點
模型能力
使用案例
🚀 賽法洛(Cephalo)模型
賽法洛(Cephalo)是一系列專注於材料科學的多模態視覺大語言模型(V - LLMs),旨在整合視覺和語言數據,以實現人類與AI或多智能體AI框架中的高級理解和交互。它能夠處理圖像和文本等多種輸入,在圖像描述、視覺問答和多模態內容生成等領域具有廣泛應用。
🚀 快速開始
環境準備
確保你已經安裝了必要的庫,如torch
、transformers
、Pillow
、requests
等。
示例代碼
以下是在GPU上快速開始的代碼示例:
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16, #if your GPU allows
_attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
更多關於模型優化(包括量化)的內容,請參考後續章節。
簡單推理示例
from transformers.image_utils import load_image
image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
便捷推理函數
def ask_about_image (model, processor, question,
images_input=[],
verbatim=False,
temperature=0.1,
show_image=False,
system="You are a biomaterials scientist who responds accurately. ",
init_instr = "",
show_conversation=True,
max_new_tokens=256,
messages=[],
images=[],
use_Markdown=False,
):
query = question
images_input=ensure_list(images_input)
if len (images)==0:
if len (images_input)>0:
for image in tqdm (images_input) :
if is_url(image):
image= load_image(image)
images.append (image)
if show_image:
display ( image )
if len (messages)==0:
base_message = {
"role": "user",
"content": [
{"type": "text", "text": system + init_instr},
# Image messages will be added dynamically here
{"type": "text", "text": query}
]
}
# Ensure the images_input is a list
images_input = ensure_list(images_input)
# Add image messages dynamically
image_messages = [{"type": "image"} for _ in images_input]
base_message["content"][1:1] = image_messages # Insert image messages before the last text message
# Append the constructed message to messages list
messages.append(base_message)
else:
messages.append (
{
"role": "user",
"content": [
{"type": "text", "text": query
}
]
}
)
if verbatim:
print (messages)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
messages.append (
{
"role": "assistant",
"content": [ {"type": "text", "text": generated_texts[0]}, ]
}
)
formatted_conversation = format_conversation(messages, images)
# Display the formatted conversation, e.g. in Jupyter Notebook
if show_conversation:
if use_Markdown:
display(Markdown(formatted_conversation))
else:
display(HTML(formatted_conversation))
return generated_texts, messages, images
question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."
url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg"
response, messages,images= ask_about_image ( model, processor, question,
images_input=[url1,],
temperature=0.1,
system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n',
show_conversation=True,
max_new_tokens=512, messages=[], images=[])
示例輸出
圖片來源:Vaishakh Manohar
The image depicts a group of ants moving in a coordinated manner to climb a vertical surface. This behavior is known as cooperative climbing and involves the use of multiple agents working together to achieve a common goal. The relevance for materials design lies in the potential application of multi-agent AI in developing new materials with improved properties through the collaboration of multiple agents.
✨ 主要特性
- 多模態融合:能夠整合視覺和語言數據,實現對複雜場景的理解和交互。
- 創新數據集生成:採用先進算法從複雜PDF文檔中提取圖像和文本描述,生成高質量的圖像 - 文本對用於訓練。
- 廣泛應用:可用於圖像描述、視覺問答、多模態內容生成等多個領域。
- 靈活輸入:支持處理圖像和文本等多種輸入。
📦 安裝指南
確保你已經安裝了Python環境,然後可以使用以下命令安裝所需的庫:
pip install torch transformers Pillow requests
💻 使用示例
基礎用法
上述快速開始部分的代碼示例展示瞭如何加載模型、處理輸入並進行推理。
高級用法
手動設置聊天模板
IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer
聊天格式示例
單輪對話:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:
多輪對話:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:
📚 詳細文檔
模型概述
賽法洛(Cephalo)模型是一系列專注於材料科學的多模態視覺大語言模型,它結合了視覺編碼器模型和自迴歸變壓器,能夠處理複雜的自然語言理解任務。該模型基於HuggingFaceM4/idefics2 - 8b - chatty模型開發,在從維基百科和科學論文中提取的科學文本 - 圖像數據上進行訓練。
數據集生成
訓練視覺模型的數據集生成方法採用先進算法,從複雜PDF文檔中準確檢測和分離圖像及其對應的文本描述。具體步驟包括從PDF中提取圖像和標題,利用大語言模型(LLMs)進行自然語言處理,創建合理的圖像 - 文本對,然後通過基於LLM的NLP處理對這些圖像 - 文本對進行精煉和驗證,確保訓練數據的高質量和上下文相關性。
模型架構
模型架構結合了視覺編碼器模型和自迴歸變壓器,以處理複雜的自然語言理解任務。
模型應用
賽法洛模型可用於多種應用場景,如:
- 圖像描述:為圖像生成準確的文本描述。
- 視覺問答:回答關於圖像的問題。
- 多模態內容生成:根據圖像和文本輸入生成相關的多模態內容。
模型優化
半精度推理
如果你的GPU支持,可以使用半精度(torch.float16
或torch.bfloat16
)加載和運行推理:
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.float16,
).to(DEVICE)
視覺編碼器效率優化
如果你的GPU內存有限,可以採取以下措施:
- 停用圖像分割:在初始化處理器(
AutoProcessor.from_pretrained
)時添加do_image_splitting=False
。
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=False
)
- 降低最大圖像分辨率:在初始化處理器時添加
size= {"longest_edge": 448, "shortest_edge": 378}
,並可根據需要調整longest_edge
的值(默認值為980
),建議使用14的倍數。
processor = AutoProcessor.from_pretrained(
f"{model_id}",
size= {"longest_edge": 448, "shortest_edge": 378}
)
使用Flash - attention 2加速生成
確保安裝了flash - attn
,並在加載模型時添加_attn_implementation="flash_attention_2"
:
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.bfloat16,
+ _attn_implementation="flash_attention_2",
).to(DEVICE)
4位量化
使用bitsandbytes
庫進行4位量化,確保安裝了accelerate
和bitsandbytes
:
+ from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.bfloat16,
+ quantization_config=quantization_config,
).to(DEVICE)
🔧 技術細節
模型架構
賽法洛模型結合了視覺編碼器模型和自迴歸變壓器,用於處理複雜的自然語言理解任務。視覺編碼器負責處理圖像輸入,自迴歸變壓器則用於生成語言輸出。
數據集生成
數據集生成過程採用先進算法從複雜PDF文檔中提取圖像和文本描述。具體步驟包括:
- 圖像和標題提取:從PDF中提取圖像和對應的標題。
- 自然語言處理:利用大語言模型(LLMs)對提取的文本進行處理,創建圖像 - 文本對。
- 精煉和驗證:通過基於LLM的NLP處理對圖像 - 文本對進行精煉和驗證,確保數據的高質量和上下文相關性。
訓練數據
模型在從維基百科和科學論文中提取的科學文本 - 圖像數據上進行訓練。
模型優化
模型優化包括半精度推理、視覺編碼器效率優化、使用Flash - attention 2加速生成和4位量化等技術,以提高模型的性能和效率。
📄 許可證
本項目採用Apache 2.0許可證。
📖 引用
請按以下格式引用本模型:
@article{Buehler_Cephalo_2024,
title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
author={Markus J. Buehler},
journal={arXiv preprint arXiv:2405.19076},
year={2024}
}








