# Llava-Llama-3-8B Open-Source Multi-Modal Model - A Powerful Tool Combining Image and Text Processing

Llava Llama 3 8b

Developed by Intel

A large multimodal model trained based on the LLaVA-v1.5 framework, using the 8-billion-parameter Meta-Llama-3-8B-Instruct as the language backbone and equipped with a CLIP-based visual encoder.

Image-to-Text

Transformers

Open Source License:Other #Multimodal Dialogue #Visual Question Answering #Instruction Following

Downloads 387

Release Time : 5/8/2024

Model Overview

This model is fine-tuned for multimodal benchmark evaluations and can also be used as a multimodal chatbot.

Model Features

Multimodal Capability

Combines a visual encoder and language model to understand and generate text content related to images.

High-Performance Benchmark

Performs excellently in multiple multimodal benchmarks such as GQA, MMVP, and Pope.

Based on LLaVA-v1.5 Framework

Utilizes an improved baseline for visual instruction tuning, enhancing performance in multimodal tasks.

Model Capabilities

Image Understanding

Multimodal Dialogue

Visual Question Answering

Image Caption Generation

Use Cases

Multimodal Evaluation

Multimodal Benchmark Testing

Used to evaluate the model's performance in multimodal tasks.

Achieved high scores in benchmarks such as GQA, MMVP, and Pope.

Chatbot

Multimodal Chat

Functions as a multimodal chatbot capable of understanding and answering image-related questions.

🚀 LLaVA-llama-3-8B

llava-llama-3-8b is a large multimodal model (LMM) that combines a powerful language backbone with a vision encoder, enabling it to handle both text and image inputs.

🚀 Quick Start

This section provides a quick guide on how to start using the llava-llama-3-8b model. Note that we only offer the trained weights difference, and you need to download the base meta-llama/Meta-Llama-3-8B-Instruct model separately.

# Copyright 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForPreTraining
import transformers

def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

def add_model_a_to_b(model_a, model_b):
    state_dict_a = model_a.state_dict()
    state_dict_b = model_b.state_dict()

    # Ensure keys match before subtraction
    if set(state_dict_a.keys()) != set(state_dict_b.keys()):
        raise ValueError("Model state dicts do not have the same keys.")

    for key in state_dict_a:
        if state_dict_a[key].shape != state_dict_b[key].shape:
            raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
        # Subtract model_a's weights from model_b for the matching key
        state_dict_b[key] = state_dict_b[key] + state_dict_a[key]

    # Update model_b with the new weights
    model_b.load_state_dict(state_dict_b)

output_checkpoint = "" # set if you don't want to merge every time
hf_checkpoint = "Intel/llava-llama-3-8b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(hf_checkpoint)
model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
    print("adding llama3 weights")
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="cpu",
    )
    llama3 = pipeline.model
    add_model_a_to_b(llama3, model.language_model)
    if output_checkpoint:
        print("saving weights, so no adding is needed again")
        model.save_pretrained(output_checkpoint)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

#original llava pads with mean, HF llava pads with zeros
image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean)) 
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

✨ Features

Multimodal Capability: The model can handle both text and image inputs, making it suitable for a wide range of multimodal tasks.
Trained with LLaVA-v1.5: Utilizes the LLaVA-v1.5 framework and data mixture for training, enhancing its performance.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# The code above in the Quick Start section shows the basic usage of the model.

Advanced Usage

There is no advanced usage code provided in the original document.

📚 Documentation

Model Details

Property	Details
Model Type	Large multimodal model (LMM)
Authors	Intel: Musashi Hinck, Matthew L. Olson, Vasudev Lal
Date	May 2024
Version	1
Paper or Other Resources	Improved Baselines with Visual Instruction Tuning
License	Intel Research Use License. All usage code is licensed Apache 2.0
Questions or Comments	Community Tab and Intel DevHub Discord

Intended Use

Property	Details
Primary intended uses	The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
Primary intended users	Anyone using or evaluating multimodal models.
Out-of-scope uses	This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights.

Factors

Property	Details
Environment	Trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators
Card Prompts	Model training and deployment on alternate hardware and software will change model performance

Training Data

The model was trained using the LLaVA-v1.5 data mixture, which includes:

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.

Ethical Considerations

Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Property	Details
Data	The model was trained using the LLaVA-v1.5 data mixture as described above.
Human life	The model is not intended to inform decisions central to human life or flourishing.
Mitigations	No additional risk mitigation strategies were considered during model development.
Risks and harms	This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
Use cases	-

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.

🔧 Technical Details

There are no specific technical details provided in the original document.

📄 License

The model is under the Intel Research Use License. All usage code is licensed Apache 2.0.

Results

Task	Metric	Value
Large Language Model	GQA	60.6138
Large Language Model	MMVP	36
Large Language Model	Pope Acc	87.33
Large Language Model	Pope F1	86.5
Large Language Model	MMVet	31.9725
Large Language Model	ScienceQA	72.9797
Large Language Model	llavaw (1)	56.9
Large Language Model	llavaw (2)	61.9
Large Language Model	llavaw (3)	73.6
Large Language Model	llavaw (4)	65.7

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご