Cephalo-Idefics-2-vision-8b-alpha Open-source Model - Multimodal Materials Science Promotes Advanced Human-computer Interaction

Cephalo Idefics 2 Vision 8b Alpha

Developed by lamm-mit

Cephalo is a series of vision-based large language models (V-LLMs) focused on multimodal materials science, designed to integrate visual and linguistic data to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.

Image-to-Text

Transformers

OtherOpen Source License:Apache-2.0 #Multimodal Materials Science #Visual Language Understanding #Scientific Image Analysis

Downloads 150

Release Time : 5/23/2024

Model Overview

Cephalo can interpret complex visual scenes and generate contextually accurate language descriptions and responses to queries. The model is developed to handle diverse inputs, including images and text, supporting a wide range of applications such as image captioning, visual question answering, and multimodal content generation.

Model Features

Multimodal Materials Science Understanding

Focuses on integrating visual and linguistic data, particularly for advanced understanding and interaction in the field of materials science.

Innovative Dataset Generation Method

Employs advanced algorithms to accurately detect and separate images and their corresponding textual descriptions from complex PDF documents, ensuring high-quality and contextually relevant training data.

Complex Visual Scene Interpretation

Capable of interpreting complex visual scenes and generating contextually accurate language descriptions and responses to queries.

Multi-agent AI Framework Support

Designed to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.

Model Capabilities

Image Captioning

Visual Question Answering

Multimodal Content Generation

Materials Science Visual Analysis

Multi-agent AI Interaction

Use Cases

Materials Science

Material Microstructure Analysis

Analyze 2D and 3D renderings of material microstructures to provide inputs for additive manufacturing methods.

Delivers accurate visual descriptions and analyses to assist in material design.

Biomimetic Applications

Inspire material design and multi-agent AI system development by analyzing behaviors in nature (e.g., ant climbing).

Provides biomimetic inspiration to promote the design of efficient and adaptable motion systems.

Multi-agent AI

Multi-agent Collaborative Systems

Analyze collaborative behaviors in nature (e.g., ant colony behavior) to design multi-agent AI systems.

Offers visual understanding and linguistic descriptions of collaborative behaviors to assist in AI system design.

🚀 Cephalo - Multimodal Vision Large Language Model

Cephalo is a series of multimodal materials science - focused vision large language models (V - LLMs). It integrates visual and linguistic data, enabling advanced understanding and interaction in human - AI or multi - agent AI frameworks. It can handle diverse inputs like images and text, facilitating applications such as image captioning, visual question answering, and multimodal content generation.

✨ Features

Innovative Dataset Generation: Employs advanced algorithms to extract images and corresponding textual descriptions from complex PDF documents, creating high - quality image - text pairs for training.
Multimodal Interaction: Can interpret complex visual scenes and generate contextually accurate language descriptions, answering queries.
Broad Application Scope: Supports various applications in materials science, including the development of generative pipelines for 2D and 3D renderings of material microstructures.

📦 Installation

The provided README does not contain specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from PIL import Image 
import requests 

DEVICE='cuda:0'

from transformers import AutoProcessor, Idefics2ForConditionalGeneration 
from tqdm.notebook import tqdm
 
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'

model = Idefics2ForConditionalGeneration.from_pretrained(  model_id,
                                                           torch_dtype=torch.bfloat16, #if your GPU allows
                                                           _attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
                                                           trust_remote_code=True,
                                                        ).to (DEVICE)
processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=True
)

Advanced Usage

def ask_about_image (model, processor, question, 
                     images_input=[], 
                     verbatim=False,
                     temperature=0.1,
                     show_image=False,
                     system="You are a biomaterials scientist who responds accurately. ", 
                     init_instr = "",
                     show_conversation=True,
                     max_new_tokens=256, 
                     messages=[], 
                     images=[], 
                     use_Markdown=False,
                    ):
    
   
    query = question
    images_input=ensure_list(images_input)
    if len (images)==0:
        if len (images_input)>0:
            for image in tqdm (images_input) :
                if is_url(image):
                    image= load_image(image)
                images.append (image)
                
                if show_image:
                    display ( image )
    if len (messages)==0:
       
        base_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": system + init_instr},
                # Image messages will be added dynamically here
                {"type": "text", "text": query}
            ]
        }
        
        # Ensure the images_input is a list
        images_input = ensure_list(images_input)
        
        # Add image messages dynamically
        image_messages = [{"type": "image"} for _ in images_input]
        base_message["content"][1:1] = image_messages  # Insert image messages before the last text message
        
        # Append the constructed message to messages list
        messages.append(base_message)

    else:
        messages.append (
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": query
                    }
                ]
            }
        )
    if verbatim:
        print (messages)
        
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
     
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
    generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)

    messages.append (
        {
            "role": "assistant",
            "content": [ {"type": "text", "text": generated_texts[0]}, ]
        }
    )
    formatted_conversation = format_conversation(messages, images)
    
    # Display the formatted conversation, e.g. in Jupyter Notebook
    if show_conversation:
     
        if use_Markdown:
            display(Markdown(formatted_conversation))
        else:
            display(HTML(formatted_conversation))

    return generated_texts, messages, images

question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."

url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg" 

response, messages,images= ask_about_image ( model, processor, question, 
                                             images_input=[url1,],
                                             temperature=0.1,
                                             system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n', 
                                             show_conversation=True,
                                             max_new_tokens=512, messages=[], images=[])

📚 Documentation

Model Summary

Cephalo is a series of multimodal materials science - focused vision large language models (V - LLMs). It is designed to integrate visual and linguistic data for advanced understanding and interaction in human - AI or multi - agent AI frameworks.

The model can interpret complex visual scenes, generate contextually accurate language descriptions, and answer queries. It combines a vision encoder model and an autoregressive transformer to process complex natural language understanding.

This version, lamm - mit/Cephalo - Idefics - 2 - vision - 8b - alpha, is based on the HuggingFaceM4/idefics2 - 8b - chatty model and was trained on a combination of scientific text - image data from Wikipedia and scientific papers. See base model details.

image/png

Chat Format

The lamm - mit/Cephalo - Idefics - 2 - vision - 8b - alpha supports one or more image inputs with prompts in the following chat format:

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:

For multi - turn conversations:

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:

If you need to manually set the chat template:

IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"

Sample Inference Code

This code shows how to quickly start on a GPU:

from PIL import Image 
import requests 

DEVICE='cuda:0'

from transformers import AutoProcessor, Idefics2ForConditionalGeneration 
from tqdm.notebook import tqdm
 
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'

model = Idefics2ForConditionalGeneration.from_pretrained(  model_id,
                                                           torch_dtype=torch.bfloat16, #if your GPU allows
                                                           _attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
                                                           trust_remote_code=True,
                                                        ).to (DEVICE)
processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=True
)

Simple inference example:

from transformers.image_utils import load_image

image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

Dataset Generation

The extraction process uses advanced algorithms to accurately detect and separate images and their corresponding textual descriptions from complex PDF documents. It extracts images and captions from PDFs to create well - reasoned image - text pairs, using large language models (LLMs) for natural language processing. These pairs are refined and validated through LLM - based NLP processing, ensuring high - quality and contextually relevant data for training.

image/png

Further Model Optimizations

If your GPU allows, load and run inference in half precision (torch.float16 or torch.bfloat16).

model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.float16,    
).to(DEVICE)

Vision encoder efficiency: Given the high resolution supported, the vision part of the model can be memory - hungry. If you are GPU - memory - constrained, you can:

Deactivate the image splitting. Add do_image_splitting=False when initializing the processor (AutoProcessor.from_pretrained).

🔧 Technical Details

The novel aspect of Cephalo's development is the innovative dataset generation method. The extraction process employs advanced algorithms to accurately detect and separate images and their corresponding textual descriptions from complex PDF documents. It involves extracting images and captions from PDFs to create well - reasoned image - text pairs, utilizing large language models (LLMs) for natural language processing. These image - text pairs are then refined and validated through LLM - based NLP processing, ensuring high - quality and contextually relevant data for training.

📄 License

The model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご