đ Cephalo - Multimodal Vision Large Language Model
Cephalo is a series of multimodal materials science - focused vision large language models (V - LLMs). It integrates visual and linguistic data, enabling advanced understanding and interaction in human - AI or multi - agent AI frameworks. It can handle diverse inputs like images and text, facilitating applications such as image captioning, visual question answering, and multimodal content generation.
⨠Features
- Innovative Dataset Generation: Employs advanced algorithms to extract images and corresponding textual descriptions from complex PDF documents, creating high - quality image - text pairs for training.
- Multimodal Interaction: Can interpret complex visual scenes and generate contextually accurate language descriptions, answering queries.
- Broad Application Scope: Supports various applications in materials science, including the development of generative pipelines for 2D and 3D renderings of material microstructures.
đĻ Installation
The provided README does not contain specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
Advanced Usage
def ask_about_image (model, processor, question,
images_input=[],
verbatim=False,
temperature=0.1,
show_image=False,
system="You are a biomaterials scientist who responds accurately. ",
init_instr = "",
show_conversation=True,
max_new_tokens=256,
messages=[],
images=[],
use_Markdown=False,
):
query = question
images_input=ensure_list(images_input)
if len (images)==0:
if len (images_input)>0:
for image in tqdm (images_input) :
if is_url(image):
image= load_image(image)
images.append (image)
if show_image:
display ( image )
if len (messages)==0:
base_message = {
"role": "user",
"content": [
{"type": "text", "text": system + init_instr},
{"type": "text", "text": query}
]
}
images_input = ensure_list(images_input)
image_messages = [{"type": "image"} for _ in images_input]
base_message["content"][1:1] = image_messages
messages.append(base_message)
else:
messages.append (
{
"role": "user",
"content": [
{"type": "text", "text": query
}
]
}
)
if verbatim:
print (messages)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
messages.append (
{
"role": "assistant",
"content": [ {"type": "text", "text": generated_texts[0]}, ]
}
)
formatted_conversation = format_conversation(messages, images)
if show_conversation:
if use_Markdown:
display(Markdown(formatted_conversation))
else:
display(HTML(formatted_conversation))
return generated_texts, messages, images
question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."
url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg"
response, messages,images= ask_about_image ( model, processor, question,
images_input=[url1,],
temperature=0.1,
system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n',
show_conversation=True,
max_new_tokens=512, messages=[], images=[])
đ Documentation
Model Summary
Cephalo is a series of multimodal materials science - focused vision large language models (V - LLMs). It is designed to integrate visual and linguistic data for advanced understanding and interaction in human - AI or multi - agent AI frameworks.
The model can interpret complex visual scenes, generate contextually accurate language descriptions, and answer queries. It combines a vision encoder model and an autoregressive transformer to process complex natural language understanding.
This version, lamm - mit/Cephalo - Idefics - 2 - vision - 8b - alpha, is based on the HuggingFaceM4/idefics2 - 8b - chatty model and was trained on a combination of scientific text - image data from Wikipedia and scientific papers. See base model details.

Chat Format
The lamm - mit/Cephalo - Idefics - 2 - vision - 8b - alpha supports one or more image inputs with prompts in the following chat format:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:
For multi - turn conversations:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:
If you need to manually set the chat template:
IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
Sample Inference Code
This code shows how to quickly start on a GPU:
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
Simple inference example:
from transformers.image_utils import load_image
image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
Dataset Generation
The extraction process uses advanced algorithms to accurately detect and separate images and their corresponding textual descriptions from complex PDF documents. It extracts images and captions from PDFs to create well - reasoned image - text pairs, using large language models (LLMs) for natural language processing. These pairs are refined and validated through LLM - based NLP processing, ensuring high - quality and contextually relevant data for training.

Further Model Optimizations
If your GPU allows, load and run inference in half precision (torch.float16
or torch.bfloat16
).
model = AutoModelForVision2Seq.from_pretrained(
"lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+ torch_dtype=torch.float16,
).to(DEVICE)
Vision encoder efficiency:
Given the high resolution supported, the vision part of the model can be memory - hungry. If you are GPU - memory - constrained, you can:
- Deactivate the image splitting. Add
do_image_splitting=False
when initializing the processor (AutoProcessor.from_pretrained
).
đ§ Technical Details
The novel aspect of Cephalo's development is the innovative dataset generation method. The extraction process employs advanced algorithms to accurately detect and separate images and their corresponding textual descriptions from complex PDF documents. It involves extracting images and captions from PDFs to create well - reasoned image - text pairs, utilizing large language models (LLMs) for natural language processing. These image - text pairs are then refined and validated through LLM - based NLP processing, ensuring high - quality and contextually relevant data for training.
đ License
The model is licensed under the Apache - 2.0 license.