InternVL - 14B - 224pxオープンソースビジュアル言語モデル - 無料でデプロイし、多種類のビジュアル言語タスクをサポート

ホーム

Internvl 14B 224px

OpenGVLabによって開発

InternVL-14B-224px は14Bパラメータの視覚言語基盤モデルで、様々な視覚言語タスクをサポートします。

テキスト生成画像

Transformers

オープンソースライセンス:MIT #マルチモーダル視覚言語 #ゼロショット学習 #多言語サポート

ダウンロード数 521

リリース時間 : 12/22/2023

モデル概要

このモデルは強力な視覚言語基盤モデルで、ゼロショット画像/動画分類、画像テキスト/動画検索、画像キャプション生成など様々なタスクをサポートします。

モデル特徴

マルチタスクサポート

ゼロショット画像/動画分類、画像テキスト/動画検索、画像キャプション生成など様々な視覚言語タスクをサポートします。

多言語サポート

英語、中国語、日本語など様々な言語のテキスト入力を処理できます。

高性能

複数のベンチマークテストで優れた性能を発揮し、強力なゼロショット性能を持っています。

モデル能力

ゼロショット画像分類

ゼロショット動画分類

画像テキスト検索

動画検索

画像キャプション生成

使用事例

コンテンツ理解

画像分類

微調整なしで画像を分類可能

複数のデータセットで優れた性能を発揮

画像キャプション生成

入力画像に対して自然言語の説明を生成

正確で流暢な説明を生成

情報検索

クロスモーダル検索

テキストに基づいて関連する画像や動画を検索

高い検索精度

🚀 InternVL-14B-224px

InternVL-14B-224pxは、画像特徴抽出に特化したビジョン言語基礎モデルです。ゼロショット画像・動画分類、画像テキスト・動画検索、画像キャプショニングなどのタスクをサポートしています。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 Mini-InternVL] [📜 InternVL 2.5]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 クイックスタート

⚠️ 重要提示

接頭辞 'summarize:' と tokenizer.pad_token_id = 0 は必須です。これらがないと結果が異常になります。

基本的な使用法

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

✨ 主な機能

ゼロショット画像・動画分類
画像テキスト・動画検索
画像キャプショニング

📚 ドキュメント

モデル詳細

属性	详情
モデルタイプ	ビジョン言語基礎モデル
サポートタスク	ゼロショット画像・動画分類、画像テキスト・動画検索、画像キャプショニング
パラメータ数	14B
画像サイズ	224 x 224
事前学習データセット	LAION-en、LAION-COCO、COYO、CC12M、CC3M、SBU、Wukong、LAION-multi

ゼロショット性能

ゼロショット評価の詳細については、このドキュメントを参照してください。

image/png

📄 ライセンス

このプロジェクトはMITライセンスの下で公開されています。

引用

このプロジェクトがあなたの研究に役立った場合は、以下の文献を引用してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{gao2024mini,
  title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={arXiv preprint arXiv:2410.16261},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}