Internlm Xcomposer2d5 7b Chat
InternLM-XComposer2.5-ChatはInternLM-XComposer2.5-7Bでトレーニングされた対話モデルで、マルチモーダル命令追従とオープンエンド対話能力が大幅に向上しています。
ダウンロード数 87
リリース時間 : 1/21/2025
モデル概要
これはマルチモーダル対話大規模モデルで、視覚質問応答とオープンエンド対話をサポートし、画像やビデオコンテンツを理解・分析し、自然言語インタラクションを行うことができます。
モデル特徴
マルチモーダル理解能力
画像、ビデオ、テキスト情報を同時に処理・理解できる
ビデオコンテンツ分析
ビデオフレームの内容を分析し、ビデオ中の動作やシーンを理解できる
高解像度画像解析
高解像度画像の詳細情報を解析できる
マルチターン対話能力
対話履歴に基づく文脈理解をサポート
モデル能力
ビデオコンテンツ理解
画像分析
マルチターン対話
マルチモーダル命令追従
オープンエンド質問応答
使用事例
コンテンツ分析
スポーツビデオ分析
スポーツ試合のビデオコンテンツを分析し、選手の動作や試合結果を識別
選手番号や試合結果などのキー情報を正確に識別できる
車両分析
異なる車両の優劣を比較
複数車種の特徴と適応シーンを詳細に分析できる
情報抽出
インフォグラフィック解析
複雑なインフォグラフィックから構造化データを抽出
インフォグラフィックのキーデータと事実を正確に抽出できる
🚀 InternLM-XComposer-2.5-Chat
InternLM-XComposer2.5-Chat は、internlm/internlm-xcomposer2d5-7b をベースに学習されたチャットモデルです。このモデルは、マルチモーダルな命令追従能力とオープンエンドな対話能力を向上させています。
InternLM-XComposer-2.5-Chat
[💻Github Repo](https://github.com/InternLM/InternLM-XComposer)
[Paper](https://huggingface.co/papers/2501.12368)
🚀 クイックスタート
Transformers からのインポート
Transformers を使用して InternLM-XComposer2-2d5-Chat モデルをロードするには、次のコードを使用します。
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ckpt_path = "internlm/internlm-xcomposer2d5-7b-chat"
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True).cuda()
# Set `torch_dtype=torch.floatb16` to load model in bfloat16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model = model.eval()
🤗 Transformers を使用した InternLM-XComposer2.5 の使用例
動画理解
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Here are some frames of a video. Describe this video in detail'
image = ['./examples/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
# The video begins with a man in a red and yellow uniform standing on the starting line of a track, preparing to compete in the 110-meter hurdles at the Athens 2004 Olympic Games. He is identified as Liu Xiang, a Chinese athlete, and his bib number is 1363. The scene is set in a stadium filled with spectators, indicating the significance of the event.
# As the race begins, all the athletes start running, but Liu Xiang quickly takes the lead. However, he encounters a hurdle and knocks it over. Despite this setback, he quickly recovers and continues to run. The race is intense, with athletes from various countries competing fiercely. In the end, Liu Xiang emerges as the winner with a time of 12.91 seconds, securing the gold medal for China.
# The video then transitions to a slow-motion replay of the race, focusing on Liu Xiang's performance and the knockdown of the hurdle. This allows viewers to appreciate the skill and determination of the athlete.
# Following the race, Liu Xiang is seen lying on the track, possibly exhausted from the intense competition. He then stands up and begins to celebrate his victory, waving his arms in the air and running around the track. The crowd cheers and celebrates with him, creating a joyful atmosphere.
# The video concludes with a replay of Liu Xiang's gold medal-winning moment, emphasizing the significance of his achievement at the Athens 2004 Olympic Games.
# Throughout the video, the Olympic logo is prominently displayed, reminding viewers of the global significance of the event and the athletes' dedication and perseverance in their pursuit of victory.
query = 'tell me the athlete code of Liu Xiang'
image = ['./examples/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, history=his, do_sample=False, num_beams=3, use_meta=True)
print(response)
# The athlete code of Liu Xiang is 1363.
複数画像のマルチチューン対話
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Image1 <ImageHere>; Image2 <ImageHere>; Image3 <ImageHere>; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'
image = ['./examples/cars1.jpg',
'./examples/cars2.jpg',
'./examples/cars3.jpg',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
# Certainly! Let's analyze the advantages and disadvantages of each car:
#
# 1. **Mercedes-Benz G-Class (SUV)**:
# - **Advantages**:
# - **Off-Road Capability**: The G-Class is renowned for its off-road prowess. It can handle a variety of terrains, making it ideal for adventurous driving.
# - **Reliability**: Mercedes-Benz is known for producing reliable vehicles, and the G-Class is no exception. It has a reputation for durability and longevity.
# - **Luxury Features**: As a Mercedes-Benz, the G-Class comes with a host of luxury features, including high-quality materials and advanced technology.
# - **Disadvantages**:
# - **Fuel Efficiency**: The G-Class is not known for its fuel efficiency. It consumes a significant amount of gasoline, which can be a disadvantage for those concerned with fuel economy.
# - **Size and Weight**: The G-Class is large and heavy, which can affect its handling and maneuverability, especially in urban environments.
# - **Cost**: The G-Class is generally more expensive compared to other SUVs, which can be a deterrent for some buyers.
#
# 2. **Bugatti Chiron (Sports Car)**:
# - **Advantages**:
# - **Performance**: The Bugatti Chiron is one of the fastest production cars available. It boasts impressive acceleration and top speed, making it a thrilling driving experience.
# - **Design**: The Chiron has a sleek and futuristic design that is both aesthetically pleasing and aerodynamically efficient.
# - **Status Symbol**: Owning a Bugatti is often seen as a status symbol, attracting attention and admiration.
# - **Disadvantages**:
# - **Cost**: The Bugatti Chiron is one of the most expensive cars in the world, making it out of reach for many potential buyers.
# - **Maintenance**: Bugatti vehicles require specialized maintenance, which can be costly and difficult to find.
# - **Road Legality**: In some regions, the Chiron may not be road-legal due to its high top speed and powerful engine.
#
# 3. **BMW i Vision Dynamics (Electric Concept Car)**:
# - **Advantages**:
# - **Environmental Impact**: As an electric vehicle, the i Vision Dynamics has a lower environmental impact compared to traditional gasoline-powered cars.
# - **Design**: The i Vision Dynamics has a unique and futuristic design that sets it apart from conventional cars.
# - **Performance**: Despite being an electric vehicle, the i Vision Dynamics boasts impressive performance, including a top speed of over 300 km/h.
# - **Disadvantages**:
# - **Cost**: Like the Bugatti Chiron, the i Vision Dynamics is an expensive concept car, making it inaccessible to most buyers.
# - **Range**: Electric vehicles typically have a limited range compared to gasoline-powered cars, which can be a concern for long-distance travel.
# - **Infrastructure**: The availability of charging stations can vary widely by region, which can limit the practicality of owning an electric car.
# In summary, the Mercedes-Benz G-Class is a reliable and capable off-road vehicle with luxury features, but it may not be the most fuel-efficient or affordable option. The Bugatti Chiron offers unparalleled performance and status, but it is extremely expensive and requires specialized maintenance. The BMW i Vision Dynamics is an innovative and environmentally friendly concept car with impressive performance, but it is also costly and limited by range and charging infrastructure. Each car has its own unique set of advantages and disadvantages, and the choice ultimately depends on individual preferences and priorities.
query = 'Image4 <ImageHere>; How about the car in Image4'
image.append('./examples/cars4.jpg')
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, history= his, use_meta=True)
print(response)
# The car in Image4 is a **Ferrari LaFerrari**. The LaFerrari is a hybrid supercar produced by Ferrari, combining a V12 engine with an electric motor. Here's a detailed analysis of its advantages and disadvantages:
#
# ### **Advantages:**
# 1. **Performance**: The LaFerrari is one of the fastest production cars available, with a top speed of over 217 mph (349 km/h). It boasts impressive acceleration and handling, making it a thrilling driving experience.
# 2. **Design**: The LaFerrari has a distinctive and aggressive design that sets it apart from other supercars. Its aerodynamic features and sleek lines contribute to its performance and visual appeal.
# 3. **Hybrid Technology**: The LaFerrari uses a hybrid powertrain, combining a 6.3-liter V12 engine with an electric motor. This hybrid system provides a balance of power and efficiency, reducing emissions compared to traditional gasoline engines.
# 4. **Status Symbol**: Owning a LaFerrari is often seen as a status symbol, attracting attention and admiration. It represents a pinnacle of automotive engineering and luxury.
# 5. **Reliability**: Ferrari is known for producing high-quality, reliable vehicles. The LaFerrari benefits from the brand's reputation for excellence in engineering and craftsmanship.
### **Disadvantages:**
# 1. **Cost**: The LaFerrari is one of the most expensive cars in the world, making it inaccessible to most potential buyers. Its high price can be a significant deterrent.
# 2. **Maintenance**: Ferrari vehicles require specialized maintenance, which can be costly and difficult to find. The hybrid system may also add to the complexity and expense of servicing the car.
# 3. **Road Legality**: In some regions, the LaFerrari may not be road-legal due to its high top speed and powerful engine. This can limit its usability and appeal.
# 4. **Fuel Efficiency**: Despite the hybrid system, the LaFerrari consumes a significant amount of fuel, which can be a disadvantage for those concerned with fuel economy.
# 5. **Size and Weight**: The LaFerrari is a large and heavy vehicle, which can affect its handling and maneuverability, especially in urban environments.
# In summary, the Ferrari LaFerrari is a high-performance hybrid supercar with a distinctive design and impressive capabilities. However, its high cost, specialized maintenance requirements, and limited road legality can be significant disadvantages for some buyers. The LaFerrari is best suited for those who prioritize performance, luxury, and status over practicality and affordability.
高解像度画像理解
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b-chat', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Analyze the given image in a detail manner'
image = ['./examples/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
# The image is an infographic titled "Amazing Facts About Dubai." Here's a detailed analysis of its content:
#
# 1. **Title and Introduction:**
# - The title is prominently displayed at the top of the image in bold, blue letters.
# - The image features a colorful skyline of Dubai, highlighting the city's modern architecture.
#
# 2. **Facts About Palm Jumeirah:**
# - Palm Jumeirah is the largest artificial island and is visible from space.
# - In 1968, there were only 1.5 million cars in Dubai.
#
# 3. **Dubai's Gold Chain:**
# - Dubai has the world's largest Gold Chain, which is 4.2 km long.
# - 7 out of the 10 tallest hotels in the world are located in Dubai.
#
# 4. **Crime Rate and Income Tax:**
# - The crime rate is near 0%.
# - The income tax rate is 0%.
#
# 5. **Dubai Mall:**
# - Dubai Mall is the largest shopping mall in the world with 1200 stores.
# - 17% of the population is Emirati, and 83% are immigrants.
#
# 6. **Dubai's Address System:**
# - Dubai has no standard address system, with no zip codes, area codes, or postal services.
#
# 7. **Dispense Gold:**
# - Dubai is building a climate-controlled City, 2.25 times as big as Monaco.
# - The Royal Suite at Burj Al Arab is $24,000 per night.
#
# 8. **License and Billionaires:**
# - You need a license to drink alcohol even at home.
# - The net worth of the four listed billionaires is roughly equal to the GDP of Honduras.
#
# 9. **Sources:**
# - The infographic cites sources from Wikipedia, Forbes, Gulf News, and The Guardian.
#
# 10. **Design and Compilation:**
# - The image is designed and compiled by FMEXtensions, a company based in the United Arab Emirates.
#
# The infographic uses a combination of text, icons, and images to convey interesting facts about Dubai, emphasizing its modernity, wealth, and unique features.
📄 ライセンス
コードは Apache-2.0 ライセンスの下で提供されています。一方、モデルの重みは学術研究用に完全にオープンであり、商用利用も無料で許可されています。商用ライセンスを申請するには、申請フォーム(英語)/申請表(中国語)に記入してください。その他の質問やコラボレーションについては、internlm@pjlab.org.cn までご連絡ください。
Clip Vit Large Patch14 336
Vision Transformerアーキテクチャに基づく大規模な視覚言語事前学習モデルで、画像とテキストのクロスモーダル理解をサポートします。
テキスト生成画像
Transformers

C
openai
5.9M
241
Fashion Clip
MIT
FashionCLIPはCLIPを基に開発された視覚言語モデルで、ファッション分野に特化してファインチューニングされ、汎用的な製品表現を生成可能です。
テキスト生成画像
Transformers 英語

F
patrickjohncyh
3.8M
222
Gemma 3 1b It
Gemma 3はGoogleが提供する軽量で先進的なオープンモデルシリーズで、Geminiモデルと同じ研究と技術に基づいて構築されています。このモデルはマルチモーダルモデルであり、テキストと画像の入力を処理し、テキスト出力を生成できます。
テキスト生成画像
Transformers

G
google
2.1M
347
Blip Vqa Base
Bsd-3-clause
BLIPは統一された視覚言語事前学習フレームワークで、視覚質問応答タスクに優れており、言語-画像共同トレーニングによりマルチモーダル理解と生成能力を実現
テキスト生成画像
Transformers

B
Salesforce
1.9M
154
CLIP ViT H 14 Laion2b S32b B79k
MIT
OpenCLIPフレームワークを使用してLAION-2B英語データセットでトレーニングされた視覚-言語モデルで、ゼロショット画像分類とクロスモーダル検索タスクをサポートします
テキスト生成画像
Safetensors
C
laion
1.8M
368
CLIP ViT B 32 Laion2b S34b B79k
MIT
OpenCLIPフレームワークを使用し、LAION-2B英語サブセットでトレーニングされた視覚-言語モデルで、ゼロショット画像分類とクロスモーダル検索をサポート
テキスト生成画像
Safetensors
C
laion
1.1M
112
Pickscore V1
PickScore v1はテキストから生成された画像に対するスコアリング関数で、人間の選好予測、モデル性能評価、画像ランキングなどのタスクに使用できます。
テキスト生成画像
Transformers

P
yuvalkirstain
1.1M
44
Owlv2 Base Patch16 Ensemble
Apache-2.0
OWLv2はゼロショットテキスト条件付き物体検出モデルで、テキストクエリを使用して画像内のオブジェクトを位置特定できます。
テキスト生成画像
Transformers

O
google
932.80k
99
Llama 3.2 11B Vision Instruct
Llama 3.2はMetaがリリースした多言語マルチモーダル大規模言語モデルで、画像テキストからテキストへの変換タスクをサポートし、強力なクロスモーダル理解能力を備えています。
テキスト生成画像
Transformers 複数言語対応

L
meta-llama
784.19k
1,424
Owlvit Base Patch32
Apache-2.0
OWL-ViTはゼロショットのテキスト条件付き物体検出モデルで、特定カテゴリの訓練データなしにテキストクエリで画像内のオブジェクトを検索できます。
テキスト生成画像
Transformers

O
google
764.95k
129
おすすめAIモデル
Llama 3 Typhoon V1.5x 8b Instruct
タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化
大規模言語モデル
Transformers 複数言語対応

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2%です。
対話システム
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。
質問応答システム 中国語
R
uer
2,694
98