Cephalo Idefics 2 Vision 8b Alpha
模型简介
模型特点
模型能力
使用案例
🚀 赛法洛(Cephalo)模型
赛法洛(Cephalo)是一系列专注于材料科学的多模态视觉大语言模型(V - LLMs),旨在整合视觉和语言数据,以实现人类与AI或多智能体AI框架中的高级理解和交互。它能够处理图像和文本等多种输入,在图像描述、视觉问答和多模态内容生成等领域具有广泛应用。
🚀 快速开始
环境准备
确保你已经安装了必要的库,如torch
、transformers
、Pillow
、requests
等。
示例代码
以下是在GPU上快速开始的代码示例:
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16, #if your GPU allows
_attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
更多关于模型优化(包括量化)的内容,请参考后续章节。
简单推理示例
from transformers.image_utils import load_image
image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
便捷推理函数
def ask_about_image (model, processor, question,
images_input=[],
verbatim=False,
temperature=0.1,
show_image=False,
system="You are a biomaterials scientist who responds accurately. ",
init_instr = "",
show_conversation=True,
max_new_tokens=256,
messages=[],
images=[],
use_Markdown=False,
):
query = question
images_input=ensure_list(images_input)
if len (images)==0:
if len (images_input)>0:
for image in tqdm (images_input) :
if is_url(image):
image= load_image(image)
images.append (image)
if show_image:
display ( image )
if len (messages)==0:
base_message = {
"role": "user",
"content": [
{"type": "text", "text": system + init_instr},
# Image messages will be added dynamically here
{"type": "text", "text": query}
]
}
# Ensure the images_input is a list
images_input = ensure_list(images_input)
# Add image messages dynamically
image_messages = [{"type": "image"} for _ in images_input]
base_message["content"][1:1] = image_messages # Insert image messages before the last text message
# Append the constructed message to messages list
messages.append(base_message)
else:
messages.append (
{
"role": "user",
"content": [
{"type": "text", "text": query
}
]
}
)
if verbatim:
print (messages)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
messages.append (
{
"role": "assistant",
"content": [ {"type": "text", "text": generated_texts[0]}, ]
}
)
formatted_conversation = format_conversation(messages, images)
# Display the formatted conversation, e.g. in Jupyter Notebook
if show_conversation:
if use_Markdown:
display(Markdown(formatted_conversation))
else:
display(HTML(formatted_conversation))
return generated_texts, messages, images
question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."
url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg"
response, messages,images= ask_about_image ( model, processor, question,
images_input=[url1,],
temperature=0.1,
system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n',
show_conversation=True,
max_new_tokens=512, messages=[], images=[])
示例输出
图片来源:Vaishakh Manohar
The image depicts a group of ants moving in a coordinated manner to climb a vertical surface. This behavior is known as cooperative climbing and involves the use of multiple agents working together to achieve a common goal. The relevance for materials design lies in the potential application of multi-agent AI in developing new materials with improved properties through the collaboration of multiple agents.
✨ 主要特性
- 多模态融合:能够整合视觉和语言数据,实现对复杂场景的理解和交互。
- 创新数据集生成:采用先进算法从复杂PDF文档中提取图像和文本描述,生成高质量的图像 - 文本对用于训练。
- 广泛应用:可用于图像描述、视觉问答、多模态内容生成等多个领域。
- 灵活输入:支持处理图像和文本等多种输入。
📦 安装指南
确保你已经安装了Python环境,然后可以使用以下命令安装所需的库:
pip install torch transformers Pillow requests
💻 使用示例
基础用法
上述快速开始部分的代码示例展示了如何加载模型、处理输入并进行推理。
高级用法
手动设置聊天模板
IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer
聊天格式示例
单轮对话:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:
多轮对话:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:
📚 详细文档
模型概述
赛法洛(Cephalo)模型是一系列专注于材料科学的多模态视觉大语言模型,它结合了视觉编码器模型和自回归变压器,能够处理复杂的自然语言理解任务。该模型基于HuggingFaceM4/idefics2 - 8b - chatty模型开发,在从维基百科和科学论文中提取的科学文本 - 图像数据上进行训练。
数据集生成
训练视觉模型的数据集生成方法采用先进算法,从复杂PDF文档中准确检测和分离图像及其对应的文本描述。具体步骤包括从PDF中提取图像和标题,利用大语言模型(LLMs)进行自然语言处理,创建合理的图像 - 文本对,然后通过基于LLM的NLP处理对这些图像 - 文本对进行精炼和验证,确保训练数据的高质量和上下文相关性。
模型架构
模型架构结合了视觉编码器模型和自回归变压器,以处理复杂的自然语言理解任务。
模型应用
赛法洛模型可用于多种应用场景,如:
- 图像描述:为图像生成准确的文本描述。
- 视觉问答:回答关于图像的问题。
- 多模态内容生成:根据图像和文本输入生成相关的多模态内容。
模型优化
半精度推理
如果你的GPU支持,可以使用半精度(torch.float16
或torch.bfloat16
)加载和运行推理:
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.float16,
).to(DEVICE)
视觉编码器效率优化
如果你的GPU内存有限,可以采取以下措施:
- 停用图像分割:在初始化处理器(
AutoProcessor.from_pretrained
)时添加do_image_splitting=False
。
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=False
)
- 降低最大图像分辨率:在初始化处理器时添加
size= {"longest_edge": 448, "shortest_edge": 378}
,并可根据需要调整longest_edge
的值(默认值为980
),建议使用14的倍数。
processor = AutoProcessor.from_pretrained(
f"{model_id}",
size= {"longest_edge": 448, "shortest_edge": 378}
)
使用Flash - attention 2加速生成
确保安装了flash - attn
,并在加载模型时添加_attn_implementation="flash_attention_2"
:
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.bfloat16,
+ _attn_implementation="flash_attention_2",
).to(DEVICE)
4位量化
使用bitsandbytes
库进行4位量化,确保安装了accelerate
和bitsandbytes
:
+ from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.bfloat16,
+ quantization_config=quantization_config,
).to(DEVICE)
🔧 技术细节
模型架构
赛法洛模型结合了视觉编码器模型和自回归变压器,用于处理复杂的自然语言理解任务。视觉编码器负责处理图像输入,自回归变压器则用于生成语言输出。
数据集生成
数据集生成过程采用先进算法从复杂PDF文档中提取图像和文本描述。具体步骤包括:
- 图像和标题提取:从PDF中提取图像和对应的标题。
- 自然语言处理:利用大语言模型(LLMs)对提取的文本进行处理,创建图像 - 文本对。
- 精炼和验证:通过基于LLM的NLP处理对图像 - 文本对进行精炼和验证,确保数据的高质量和上下文相关性。
训练数据
模型在从维基百科和科学论文中提取的科学文本 - 图像数据上进行训练。
模型优化
模型优化包括半精度推理、视觉编码器效率优化、使用Flash - attention 2加速生成和4位量化等技术,以提高模型的性能和效率。
📄 许可证
本项目采用Apache 2.0许可证。
📖 引用
请按以下格式引用本模型:
@article{Buehler_Cephalo_2024,
title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
author={Markus J. Buehler},
journal={arXiv preprint arXiv:2405.19076},
year={2024}
}








