InternViT-300M開源視覺模型 - 支持多種視覺任務，免費上手使用

Home

Vit Intern300m Patch14 448.ogvl Dist

Developed by timm

InternViT-300M是一個由OpenGVLab團隊開發的視覺Transformer模型，通過從InternViT-6B蒸餾預訓練而來，支持多種視覺任務。

圖像分類

Transformers

Open Source License:MIT #多模態視覺特徵 #高分辨率448px #OCR增強

Downloads 147

Release Time : 10/16/2024

Model Overview

該模型是一個基於ViT架構的圖像特徵提取模型，主要用於圖像分類和特徵提取任務，支持448x448分辨率的圖像輸入。

Model Features

高分辨率支持

支持448x448的高分辨率圖像輸入，適合需要精細視覺特徵的任務。

多數據集預訓練

在LAION-en/zh、COYO、GRIT等多個大型數據集上預訓練，具有強大的泛化能力。

蒸餾模型

從更大的InternViT-6B模型蒸餾而來，在保持性能的同時減小了模型規模。

Model Capabilities

圖像分類

視覺特徵提取

圖像嵌入生成

Use Cases

計算機視覺

圖像分類

對輸入圖像進行分類，識別圖像中的主要對象或場景。

在多個基準數據集上表現優異

視覺特徵提取

提取圖像的深度視覺特徵，可用於下游任務如目標檢測、圖像檢索等。

🚀 vit_intern300m_patch14_448.ogvl_dist模型卡片

這是一個InternViT圖像特徵模型。由論文作者使用多種圖像 - 文本數據，從InternViT - 6B進行蒸餾預訓練得到。模型權重已從OpenGVLab/InternViT - 300M - 448px的原始格式轉換為timm的vit格式。注意：此vit在特徵/頭部之前沒有最終歸一化層。

🚀 快速開始

本模型可用於圖像分類、特徵圖提取和圖像嵌入等任務，具體使用方法見下方“💻 使用示例”部分。

✨ 主要特性

基於InternViT架構，能有效提取圖像特徵。
使用多種圖像 - 文本數據進行蒸餾預訓練，具有良好的泛化能力。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

圖像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_intern300m_patch14_448.ogvl_dist', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

高級用法

特徵圖提取

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_intern300m_patch14_448.ogvl_dist',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 1024, 32, 32])
    #  torch.Size([1, 1024, 32, 32])
    #  torch.Size([1, 1024, 32, 32])

    print(o.shape)

圖像嵌入

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_intern300m_patch14_448.ogvl_dist',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1025, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	圖像分類 / 特徵主幹網絡
模型統計信息	參數（M）：304.0 GMACs：362.0 激活值（M）：656.4 圖像尺寸：448 x 448
相關論文	InternVL2: Better than the Best: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/ InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks: https://arxiv.org/abs/2312.14238
原始代碼庫	https://github.com/OpenGVLab/InternVL
訓練數據集	LAION - en LAION - zh COYO GRIT COCO TextCaps Objects365 OpenImages All - Seeing Wukong - OCR LaionCOCO - OCR other - OCR

引用信息

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}