đ Cephalo - Multimodal Vision Large Language Model
Cephalo is a series of multimodal vision large language models (V - LLMs) focused on materials science. It integrates visual and linguistic data, enabling advanced understanding and interaction in human - AI or multi - agent AI frameworks.
⨠Features
- Innovative Dataset Generation: Utilizes advanced algorithms to extract images and corresponding textual descriptions from complex PDF documents, creating high - quality image - text pairs for training.
- Multimodal Processing: Can interpret complex visual scenes, generate contextually accurate language descriptions, and answer queries. It can process diverse inputs like images and text, supporting a wide range of applications such as image captioning, visual question answering, and multimodal content generation.
- Robust Framework: Provides a framework for multimodal interaction and understanding, including the development of generative pipelines for 2D and 3D renderings of material microstructures for additive manufacturing.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-10b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
Advanced Usage
from transformers.image_utils import load_image
image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi - agent AI."},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
đ Documentation
Model Summary
Cephalo is a series of multimodal materials science focused vision large language models (V - LLMs). The innovative dataset generation method extracts images and captions from PDFs to create image - text pairs, which are refined through LLM - based NLP processing.
The model can interpret complex visual scenes, generate accurate language descriptions, and answer queries. It combines a vision encoder model and an autoregressive transformer for natural language understanding.
This version, lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha, is based on a merged expansion of https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta and the HuggingFaceM4/idefics2 - 8b - chatty model.
The model was trained in three stages:
- Train https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta by fine - tuning the HuggingFaceM4/idefics2 - 8b - chatty model.
- Combine the decoder of https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta with the last 8 layers of the HuggingFaceM4/idefics2 - 8b - chatty decoder.
- Fine - tune the merged model with 40 decoder layers and a total of 10b parameters.
The model was trained on scientific text - image data from Wikipedia and scientific papers. For more details on the base model, see: https://huggingface.co/HuggingFaceM4/idefics2 - 8b - chatty.
Chat Format
The lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha model supports one or more image inputs with prompts in the chat format.
Single - turn example:
User: You carefully study the image, and respond accurately, but succinctly. Think step - by - step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi - agent AI.<end_of_utterance>
Assistant:
Multi - turn example:
User: You carefully study the image, and respond accurately, but succinctly. Think step - by - step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi - agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi - agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:
If you need to manually set the chat template:
IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
đ§ Technical Details
The model architecture combines a vision encoder model and an autoregressive transformer. The innovative dataset generation method involves extracting images and captions from PDFs, using LLMs for natural language processing, and refining the image - text pairs through LLM - based NLP processing.
The training process includes fine - tuning and merging models, and it uses a combination of scientific text - image data from Wikipedia and scientific papers.
đ License
The model is licensed under the Apache - 2.0 license.