InternLM-XComposer2-4KHD-7Bオープンソースビジュアル言語大規模モデル

Home

Internlm Xcomposer2 4khd 7b

Developed by internlm

InternLM-XComposer2-4KHDはInternLM2を基にした汎用視覚言語大モデルで、4K解像度の画像理解能力を備えています。

テキスト生成画像

Transformers

Open Source License:Other #4K画像理解 #マルチターン視覚対話 #高解像度視覚質問応答

Downloads 1,180

Release Time : 4/7/2024

Model Overview

InternLM-XComposer2-4KHDは汎用視覚言語大モデル(VLLM)で、高解像度画像(4K)を処理し画像内容を理解でき、視覚質問応答などのタスクをサポートします。

Model Features

4K解像度画像理解

最大4K解像度の高精細画像内容の理解と分析をサポート

マルチターン視覚対話

画像に基づくマルチターン対話をサポートし、文脈を記憶して一貫したコミュニケーションが可能

高精度画像記述

詳細で正確な画像記述を生成でき、画像中の細部まで捉えることが可能

Model Capabilities

高解像度画像理解

視覚質問応答

画像内容記述

マルチターン視覚対話

Use Cases

画像分析

インフォグラフィック解釈

複雑なインフォグラフィックの内容とトレンドを分析

インフォグラフィックの各部分を正確に識別し、内容を詳細に記述できる

視覚支援

画像内容記述

視覚障害者向けに画像内容の詳細な記述を提供

正確で詳細な画像記述を生成

🚀 InternLM-XComposer2-4KHD

InternLM-XComposer2-4KHD は、InternLM2 をベースとした汎用的なビジョン言語大規模モデル（VLLM）で、4K解像度の画像理解能力を備えています。

InternLM-XComposer2-4KHD

[💻Github Repo](https://github.com/InternLM/InternLM-XComposer) [Paper](https://arxiv.org/abs/2401.16420)

🚀 クイックスタート

🤗 Transformers を使用して InternLM-XComposer を使う方法を簡単な例で紹介します。

基本的な使用法

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2-4khd-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2-4khd-7b', trust_remote_code=True)

###############
# First Round
###############

query1 = '<ImageHere>Illustrate the fine details present in the image'
image = './example.webp'
with torch.cuda.amp.autocast():
  response, his = model.chat(tokenizer, query=query, image=image, hd_num=55, history=[], do_sample=False, num_beams=3)
print(response)
# The image is a vibrant and colorful infographic that showcases 7 graphic design trends that will dominate in 2021. The infographic is divided into 7 sections, each representing a different trend. 
# Starting from the top, the first section focuses on "Muted Color Palettes", highlighting the use of muted colors in design.
# The second section delves into "Simple Data Visualizations", emphasizing the importance of easy-to-understand data visualizations. 
# The third section introduces "Geometric Shapes Everywhere", showcasing the use of geometric shapes in design. 
# The fourth section discusses "Flat Icons and Illustrations", explaining how flat icons and illustrations are being used in design. 
# The fifth section is dedicated to "Classic Serif Fonts", illustrating the resurgence of classic serif fonts in design.
# The sixth section explores "Social Media Slide Decks", illustrating how slide decks are being used on social media. 
# Finally, the seventh section focuses on "Text Heavy Videos", illustrating the trend of using text-heavy videos in design. 
# Each section is filled with relevant images and text, providing a comprehensive overview of the 7 graphic design trends that will dominate in 2021.

###############
# Second Round
###############
query1 = 'what is the detailed explanation of the third part.'
with torch.cuda.amp.autocast():
  response, _ = model.chat(tokenizer, query=query1, image=image, hd_num=55, history=his, do_sample=False, num_beams=3)
print(response)
# The third part of the infographic is about "Geometric Shapes Everywhere". It explains that last year, designers used a lot of
# flowing and abstract shapes in their designs. However, this year, they have been replaced with rigid, hard-edged geometric
# shapes and patterns. The hard edges of a geometric shape create a great contrast against muted colors.

Transformers からのインポート

Transformers を使って InternLM-XComposer2-4KHD モデルをロードするには、以下のコードを使用します。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ckpt_path = "internlm/internlm-xcomposer2-4khd-7b"
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True).cuda()
# Set `torch_dtype=torch.floatb16` to load model in bfloat16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model = model.eval()

📄 ライセンス

コードは Apache 2.0 ライセンスの下で提供されており、モデルの重みは学術研究用に完全にオープンであり、商用利用も無料で許可されています。商用ライセンスを申請するには、申請フォーム（英語）/申請表（中国語）に記入してください。その他の質問やコラボレーションについては、internlm@pjlab.org.cn までご連絡ください。