llava-gemma-2bオープンソース多モーダルモデル - 視覚と言語を融合し、多様なアプリケーション体験を提供

ホーム

Llava Gemma 2b

Intelによって開発

LLaVA-Gemma-2bはLLaVA-v1.5フレームワークでトレーニングされた大規模マルチモーダルモデルで、20億パラメータのGemma-2b-itを言語バックボーンとして使用し、CLIP視覚エンコーダーを組み合わせています。

画像生成テキスト

Transformers

英語#マルチモーダルチャット #コンパクト視覚言語 #命令追従

ダウンロード数 1,503

リリース時間 : 3/14/2024

モデル概要

このモデルはマルチモーダルベンチマーク評価向けに微調整されており、マルチモーダルチャットボットとして使用可能で、画像とテキストのインタラクションをサポートします。

モデル特徴

コンパクトで効率的

20億パラメータのGemma-2b-itを言語バックボーンとして採用し、性能を維持しながら計算リソース要件を低減。

マルチモーダル理解

CLIP視覚エンコーダーを組み合わせ、画像とテキスト入力を同時に処理し、クロスモーダル理解を実現。

迅速なトレーニング

8つのインテルGaudi 2 AIアクセラレータでわずか4時間でトレーニングを完了。

モデル能力

画像キャプション生成

視覚的質問応答

マルチモーダル対話

テキスト要約

使用事例

マルチモーダルチャットボット

画像内容の質問応答

ユーザーが画像をアップロードし関連内容を質問すると、モデルが正確な説明と回答を生成。

VQAv2ベンチマークテストで70.7の精度を達成

学術研究

マルチモーダルモデル研究

研究者にコンパクトモデル研究プラットフォームを提供し、計算効率とマルチモーダル理解のバランスを探求。

language:

en license_name: intel-research-use-license license_link: LICENSE.md base_model: google/gemma-2b-it tags:
LLM
Intel model-index:
name: llava-gemma-2b results:
- task: type: Large Language Model name: Large Language Model metrics:
  - type: GQA name: GQA value: 0.531
  - type: MME Cog. name: MME Cog. value: 236
  - type: MME Per. name: MME Per. value: 1130
  - type: MM-Vet name: MM-Vet value: 17.7
  - type: POPE Acc. name: POPE Acc. value: 0.850
  - type: POPE F1 name: POPE F1 value: 0.839
  - type: VQAv2 name: VQAv2 value: 70.7
  - type: MMVP name: MMVP value: 0.287
  - type: ScienceQA Image name: ScienceQA Image value: 0.564 library_name: transformers pipeline_tag: image-text-to-text

Model Details: LLaVA-Gemma-2b

llava-gemma-2b is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 2-billion parameter google/gemma-2b-it model as language backbone and the CLIP-based vision encoder.

Model Details	Description
Authors	Intel: Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal
Date	March 2024
Version	1
Type	Large multimodal model (LMM)
Paper or Other Resources	LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
License	Gemma
Questions or Comments	Community Tab and Intel DevHub Discord

This model card was created by Benjamin Consolvo and the authors listed above.

Intended Use

Intended Use	Description
Primary intended uses	The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
Primary intended users	Anyone using or evaluating multimodal models.
Out-of-scope uses	This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights.

How to use

Using llava-gemma requires a modified preprocessor if your transformers version is < 4.41.1

For current usage, see usage.py or the following code block:

import requests
from PIL import Image
from transformers import (
  LlavaForConditionalGeneration,
  AutoTokenizer,
  AutoProcessor,
  CLIPImageProcessor
)
#In this repo, needed for version < 4.41.1
#from processing_llavagemma import LlavaGemmaProcessor
#processor = LlavaGemmaProcessor( tokenizer=AutoTokenizer.from_pretrained(checkpoint), image_processor=CLIPImageProcessor.from_pretrained(checkpoint))

checkpoint = "Intel/llava-gemma-2b"

# Load model
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

For straightforward use as a chatbot (without images), you can modify the last portion of code to the following:

# Prepare inputs
# Use gemma chat template
prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "Summarize the following paragraph? In this paper, we introduced LLaVA-Gemma, a compact vision-language model leveraging the Gemma Large Language Model in two variants, Gemma-2B and Gemma-7B. Our work provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. The availability of both variants allows for a comparative analysis that sheds light on how model size impacts performance in various tasks. Our evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models. With these models, future practitioners can optimize the performance of small-scale multimodal models more directly."}],
    tokenize=False,
    add_generation_prompt=True
)
# url = "https://www.ilankelman.org/stopsigns/australia.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=None, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=300)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Factors

Factors	Description
Groups	-
Instrumentation	-
Environment	Trained for 4 hours on 8 Intel Gaudi 2 AI accelerators.
Card Prompts	Model training and deployment on alternate hardware and software will change model performance

Metrics

Metrics	Description
Model performance measures	We evaluate the LlaVA-Gemma models on a similar collection of benchmarks to other LMM works: GQA; MME; MM-Vet; POPE (accuracy and F1); VQAv2; MMVP; the image subset of ScienceQA. Our experiments provide insights into the efficacy of various design choices within the LLaVA framework.
Decision thresholds	-
Approaches to uncertainty and variability	-

Training Data

The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.

Quantitative Analyses

Performance of LLaVA-Gemma models across seven benchmarks. Highlighted box indicates strongest performance amongst LLaVA-Gemma models. Bottom two rows show self-reported performance of Llava Phi-2 and LLaVA-v1.5 respectively. The bolded gemma-2b-it is the current model used here in this model card.

LM Backbone	Vision Model	Pretrained Connector	GQA	MME cognition	MME perception	MM-Vet	POPE accuracy	POPE F1	VQAv2	ScienceQA Image	MMVP
gemma-2b-it	CLIP	Yes	0.531	236	1130	17.7	0.850	0.839	70.65	0.564	0.287
gemma-2b-it	CLIP	No	0.481	248	935	13.1	0.784	0.762	61.74	0.549	0.180
gemma-2b-it	DinoV2	Yes	0.587	307	1133	19.1	0.853	0.838	71.37	0.555	0.227
gemma-2b-it	DinoV2	No	0.501	309	959	14.5	0.793	0.772	61.65	0.568	0.180

gemma-7b-it	CLIP	Yes	0.472	253	895	18.2	0.848	0.829	68.7	0.625	0.327
gemma-7b-it	CLIP	No	0.472	278	857	19.1	0.782	0.734	65.1	0.636	0.240
gemma-7b-it	DinoV2	Yes	0.519	257	1021	14.3	0.794	0.762	65.2	0.628	0.327
gemma-7b-it	DinoV2	No	0.459	226	771	12.2	0.693	0.567	57.4	0.598	0.267

Phi-2b	CLIP	Yes	-	-	1335	28.9	-	0.850	71.4	0.684	-
Llama-2-7b	CLIP	Yes	0.620	348	1511	30.6	0.850	0.859	78.5	0.704	46.1

Ethical Considerations

Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Ethical Considerations	Description
Data	The model was trained using the LLaVA-v1.5 data mixture as described above.
Human life	The model is not intended to inform decisions central to human life or flourishing.
Mitigations	No additional risk mitigation strategies were considered during model development.
Risks and harms	This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
Use cases	-

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Citation details

@misc{hinck2024llavagemma,
      title={LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model}, 
      author={Musashi Hinck and Matthew L. Olson and David Cobbley and Shao-Yen Tseng and Vasudev Lal},
      year={2024},
      eprint={2404.01331},
      url={https://arxiv.org/abs/2404.01331},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}