オープンソースのclip - vit - base - patch32_lego - brickモデル - レゴブロックの高精度識別と対応する説明

ホーム

Clip Vit Base Patch32 Lego Brick

armaggheddon97によって開発

CLIPモデルをファインチューニングしたレゴブロックの画像-テキストマッチングモデルで、レゴブロックとその説明を識別するために設計されています。

テキスト生成画像

Transformers

英語オープンソースライセンス:MIT #レゴブロック識別 #ゼロショット分類 #高精度マッチング

ダウンロード数 44

リリース時間 : 1/24/2025

モデル概要

このモデルはレゴブロックの説明データセットでファインチューニングされたCLIPモデルで、レゴブロックの画像と対応するテキスト説明を正確にマッチングし、ユーザーが説明や画像を通じて特定のブロックを見つけるのを支援します。

モデル特徴

高精度マッチング

モデルはファインチューニングされており、高い信頼度でレゴブロック画像とテキスト説明を正確にマッチングできます。

ゼロショット分類

追加のトレーニングなしで新しいカテゴリを分類できるゼロショット画像分類をサポートします。

マルチモーダル処理

画像とテキスト入力を同時に処理し、対応する埋め込みベクトルを生成します。

モデル能力

画像分類

テキスト-画像マッチング

画像埋め込み生成

テキスト埋め込み生成

使用事例

レゴブロック識別

ブロック検索

テキスト説明や画像アップロードを通じて特定のレゴブロックを検索します。

モデルは高い信頼度で最も一致するブロック結果を返すことができます。

ゼロショット分類

新しいレゴブロックカテゴリを分類し、追加のトレーニングは不要です。

テストデータセットでの精度は99.23%です。

🚀 clip-vit-base-patch32_lego-brick

このモデルは、LEGOブリックの画像とそれに対応するテキスト記述をマッチングするために特化した、CLIP（Contrastive Language-Image Pretraining）モデルです。

🚀 クイックスタート

このモデルは、openai/clip-vit-base-patch32 CLIPモデルをlego_brick_captionsデータセットでファインチューニングしたバージョンです。LEGOブリックの画像とそれに対応するテキスト記述をマッチングすることに特化しています。

⚠️ 重要提示

コードに興味がある場合は、私のGitHubのファインチューニングスクリプトを参照してください。

✨ 主な機能

このモデルは、LEGOブリックの画像とそれに対応するテキスト記述をマッチングすることができます。例えば、「青い湾曲した斜面のブリック」という説明を入力するか、ピースの画像をアップロードすると、最も近いマッチを見つけることができます。これは、LEGO愛好家やビルダー、あるいはブリックの中から宝物を探すのが好きな人に最適です。

Web UI

Colabでライブデモを試してみてください！

📚 ドキュメント

モデルの詳細

開発者: ベースモデルはOpenAIによって開発され、ファインチューニングされたモデルは私、armaggheddon97によって開発されました。
モデルの種類: このモデルはCLIP（Contrastive Language-Image Pretraining）モデルです。
言語: このモデルは英語のテキストを入力として期待しています。
ライセンス: このモデルはMITライセンスの下でライセンスされています。
ファインチューニング元のモデル: このモデルは、openai/clip-vit-base-patch32モデルをlego_brick_captionsデータセットでファインチューニングしたバージョンです。モデルは、データセットを80-20のトレイン-バリデーション分割で7エポックファインチューニングされました。ファインチューニングスクリプトの詳細については、私のGitHubのコードを参照してください。

属性	详情
モデルタイプ	CLIP（Contrastive Language-Image Pretraining）モデル
訓練データ	`lego_brick_captions`データセット

📦 インストール

このモデルを使用するには、🤗 transformersライブラリをインストールする必要があります。以下のコードスニペットを使用して、モデルとプロセッサをロードできます。

💻 使用例

基本的な使用法

import torch
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

高度な使用法

# Autoクラスを使用する場合
from transformers import AutoModelForZeroShotImageClassification, AutoProcessor

model = AutoModelForZeroShotImageClassification.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
processor = AutoProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

# pipelineを使用する場合
from transformers import pipeline

model = "armaggheddon97/clip-vit-base-patch32_lego-brick"
clip_classifier = pipeline("zero-shot-image-classification", model=model)

float16精度でのロード

提供されているモデルはfloat32精度です。推論を高速化するためにfloat16精度でモデルをロードするには、以下のコードスニペットを使用できます。

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", dtype=torch.float16)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

または、torchを直接使用して以下のようにできます。

import torch
from transformers import CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
model_fp16 = model.to(torch.float16)

ユースケース

埋め込みの生成

# テキストのみを埋め込む場合
import torch
from transformers import CLIPTokenizerFast, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
tokenizer = CLIPTokenizerFast.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

text = ["a photo of a lego brick"]
tokens = tokenizer(text, return_tensors="pt", padding=True).to(device)
outputs = model.get_text_features(**tokens)

# 画像のみを埋め込む場合
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt").to(device)
outputs = model.get_image_features(**inputs)

ゼロショット画像分類

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

dataset = load_dataset("armaggheddon97/lego_brick_captions", split="test")

captions = [
    "a photo of a lego brick with a 2x2 plate",
    "a photo of gray minifigure legs",
    "a photo of a brick with a curved slope",
]
image = dataset[0]["image"]

inputs = processor(text=captions, images=image, return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probabilities = logits_per_image.softmax(dim=1)
max_prob_idx = torch.argmax(logits_per_image, dim=1)

結果

このモデルの目標は、テキスト記述に基づいてブリック画像をより正確に区別できるモデルを得ることでした。精度の面では、両方のモデルが同様に機能します。しかし、ゼロショット画像分類セクションのコードを使用して分類タスクをテストすると、ファインチューニングされたモデルは、はるかに高い信頼度で画像を正確に分類することができます。

例えば、以下の入力でモデルをテストすると、ファインチューニングされたモデルは以下のように出力します。

100.00%: "A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
0.00%: "A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%: "A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

一方、ベースモデルは同じ入力に対して以下のように出力します。

98.7%: "A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
1.24%: "A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%: "A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

これは、ファインチューニングされたモデルがテキスト記述に基づいて画像を正確に分類できることを示しています。ただし、ベースモデルも画像を正しく分類できますが、信頼度はわずかに低くなります。

同じタスクを全データセットで実行すると、以下のメトリクスが得られます。 results

このプロットは、ファインチューニングされたモデルとベースモデルによって生成された正規化されたテキストロジットを視覚化しています。

`short_caption`でのファインチューニング

練習として、モデルをデータセットのshort_caption列でもファインチューニングしました。そして、以前と同じ方法を使用して、caption列でファインチューニングされたベースモデルと比較しました。short_captionのラベルを使用して同じサンプル画像を使用すると、結果は以下のようになります。

short_captionでファインチューニングされたモデル:

100.00%: " Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (2.32e-21): " Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (5.91e-18): "Brick 2 x 2 without Inside Ridges"

captionでファインチューニングされたモデル:

100.00% (1): " Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (3.38e-14): " Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (2.9e-8): "Brick 2 x 2 without Inside Ridges"

ベースモデル:

0.00%: " Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
22.07%: " Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
77.79%: "Brick 2 x 2 without Inside Ridges"

short_caption列でファインチューニングした場合、caption列でファインチューニングしたモデルと比較すると、結果はかなり似ています。唯一の違いは、正しいキャプションと間違ったキャプションの値の幅が広いことです。この場合、ベースモデルは、この分類にcaption列を使用した場合よりも大幅に悪い結果を示し、間違ったキャプションを割り当てます。

同じタスクを全データセットで実行すると、short_captionでファインチューニングされたモデルの精度は99.99%で、captionでファインチューニングされたモデルの精度は98.48%です。精度は高いですが、トレードオフとして、正しいキャプションに対する信頼度が異なります。caption列でファインチューニングされたモデルは、より包括的なcaption列により、テキスト検索の柔軟性も高く、ここにアップロードされたモデルです。

ベースモデルは、全データセットをループする際にも以前と同様に機能し、全体の精度は約97%です。これは、選択されたサンプルがベースモデルにとって外れ値であった可能性を示しています。