開源clip-vit-base-patch32_lego-brick模型 - 精準識別樂高積木及對應描述

首頁

Clip Vit Base Patch32 Lego Brick

由armaggheddon97開發

基於CLIP模型微調的樂高積木圖像-文本匹配模型，專為識別樂高積木及其描述設計。

文本生成圖像

Transformers

英語開源協議:MIT #樂高積木識別 #零樣本分類 #高精度匹配

下載量 44

發布時間 : 1/24/2025

模型概述

此模型是在樂高積木描述數據集上微調的CLIP模型，用於準確匹配樂高積木圖像與其對應的文本描述，幫助用戶通過描述或圖片找到特定積木。

模型特點

高精度匹配

模型經過微調，能夠以高置信度準確匹配樂高積木圖像與文本描述。

零樣本分類

支持零樣本圖像分類，無需額外訓練即可對新類別進行分類。

多模態處理

同時處理圖像和文本輸入，生成對應的嵌入向量。

模型能力

圖像分類

文本-圖像匹配

生成圖像嵌入

生成文本嵌入

使用案例

樂高積木識別

積木搜索

通過文本描述或上傳圖片查找特定樂高積木。

模型能夠以高置信度返回最匹配的積木結果。

零樣本分類

對新的樂高積木類別進行分類，無需額外訓練。

在測試數據集上準確率達到99.23%。

🚀 clip-vit-base-patch32_lego-brick模型

本模型是一個基於CLIP（對比語言-圖像預訓練）架構的模型，專門用於將樂高積木的圖像與對應的文字描述進行匹配，能有效解決樂高愛好者在識別積木時的難題，提升積木識別的準確性和效率。

🚀 快速開始

本模型是openai/clip-vit-base-patch32 CLIP（對比語言-圖像預訓練）模型在lego_brick_captions數據集上的微調版本，專門用於將樂高積木的圖像與對應的文字描述進行匹配。

⚠️ 重要提示

如果你對使用的代碼感興趣，請參考我GitHub上的微調腳本。

✨ 主要特性

還在為搞不清某個難以捉摸的樂高積木的名字而苦惱嗎？或者你只有一個模糊的概念或一張圖片，但卻不知道確切的零件編號？BricksFinder就能幫你解決這些問題！

你只需輸入像“藍色彎曲斜坡”這樣的描述，或者上傳一塊積木的圖片，我們的模型就會發揮它的魔力，找到最匹配的結果。它會為你展示一系列看起來和你所想的積木一模一樣的圖片，甚至可能更好！

Web UI

這個模型非常適合樂高愛好者、積木搭建者，或者任何喜歡在積木中尋寶的人。你可以點擊下面的鏈接，在Colab上進行即時演示並嘗試一下：

📦 安裝指南

使用🤗 transformers加載模型

使用以下代碼片段加載模型和處理器：

import torch
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

使用Auto類：

from transformers import AutoModelForZeroShotImageClassification, AutoProcessor

model = AutoModelForZeroShotImageClassification.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
processor = AutoProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

使用pipeline：

from transformers import pipeline

model = "armaggheddon97/clip-vit-base-patch32_lego-brick"
clip_classifier = pipeline("zero-shot-image-classification", model=model)

以float16精度加載模型

提供的模型是float32精度的。若要以float16精度加載模型以加快推理速度，可以使用以下代碼片段：

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", dtype=torch.float16)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

或者直接使用torch：

import torch
from transformers import CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
model_fp16 = model.to(torch.float16)

💻 使用示例

基礎用法

生成嵌入

僅嵌入文本：

import torch
from transformers import CLIPTokenizerFast, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
tokenizer = CLIPTokenizerFast.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

text = ["a photo of a lego brick"]
tokens = tokenizer(text, return_tensors="pt", padding=True).to(device)
outputs = model.get_text_features(**tokens)

僅嵌入圖像：

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt").to(device)
outputs = model.get_image_features(**inputs)

零樣本圖像分類

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

dataset = load_dataset("armaggheddon97/lego_brick_captions", split="test")

captions = [
    "a photo of a lego brick with a 2x2 plate",
    "a photo of gray minifigure legs",
    "a photo of a brick with a curved slope",
]
image = dataset[0]["image"]

inputs = processor(text=captions, images=image, return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probabilities = logits_per_image.softmax(dim=1)
max_prob_idx = torch.argmax(logits_per_image, dim=1)

📚 詳細文檔

模型描述

開發者：基礎模型由OpenAI開發，微調模型由armaggheddon97開發。
模型類型：該模型是一個CLIP（對比語言-圖像預訓練）模型。
語言：該模型期望輸入英文文本。
許可證：該模型遵循MIT許可證。
基於clip-vit-base-patch32微調：該模型是openai/clip-vit-base-patch32模型在lego_brick_captions數據集上的微調版本。模型在該數據集80-20的訓練-驗證分割上進行了7個epoch的微調。有關微調腳本的更多詳細信息，請查看我GitHub上的代碼。

結果

目標是獲得一個能夠根據文字描述更準確地區分積木圖像的模型。在這方面，就準確性而言，兩個模型的表現相似。然而，當使用零樣本圖像分類部分中的代碼進行分類任務測試時，微調後的模型能夠以更高的置信度更準確地對圖像進行分類。例如，當使用以下輸入測試模型時：

A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes.
A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface.
A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white

並使用以下圖像作為輸入：

微調後的模型輸出：

100.00%："A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
0.00%："A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%："A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

而基礎模型對於相同的輸入給出：

98.7%："A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
1.24%："A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%："A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

這表明微調後的模型能夠根據文字描述準確地對圖像進行分類。然而，基礎模型也能夠正確地對圖像進行分類，只是置信度略低。

在整個數據集上運行相同的任務，每個樣本有1個正確的描述（始終是第一個）和2個隨機採樣的描述，得到以下指標： results

該圖可視化了微調模型和基礎模型產生的歸一化文本對數幾率：

輸入：對於每個樣本，選取一張樂高積木的圖像以及三個描述：
- 正確描述：與圖像匹配的描述（位置0）。
- 兩個隨機採樣的錯誤描述（位置1和2）。
輸出：模型為每個描述生成文本對數幾率，反映圖像嵌入與每個描述嵌入之間的相似度。然後對這些對數幾率進行歸一化處理，以便於可視化。
熱力圖可視化：歸一化後的對數幾率以熱力圖的形式顯示，其中：
- 每個行代表一個不同的輸入樣本。
- 每個列代表三個描述之一：正確描述（0，第一行），以及兩個隨機描述（1和2，第二行和第三行）。
- 顏色強度代表模型為每個描述分配的歸一化對數幾率得分，顏色越深表示得分越高，置信度越高（即第一行與第二行和第三行之間的對比度越大，結果越好）。

基礎模型（右側），正如預期的那樣，在任何類別中都沒有顯示出高置信度，對圖像和文本樣本的區分能力較差，標籤得分之間的差異也小得多。然而，就準確性而言，它仍然能夠在97.46%的樣本上正確分配正確的描述。

微調模型（左側）在正確描述上顯示出更高的置信度，能夠清晰地區分正確描述和錯誤描述。這體現在為正確描述分配的得分更高，為錯誤描述分配的得分更低。就準確性而言，微調模型的結果相似，但略低於基礎模型，準確率為99.23%。

在`short_caption`上微調

作為一個實驗，模型還在數據集的short_caption列上進行了微調。並使用與之前相同的方法，將其與在caption列上微調的基礎模型進行了比較。使用相同的樣本圖像和short_caption中的標籤，結果如下：

在short_caption上微調的模型：

100.00%：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (2.32e-21)：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (5.91e-18)："Brick 2 x 2 without Inside Ridges"

在caption上微調的模型：

100.00% (1)：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (3.38e-14)：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (2.9e-8)："Brick 2 x 2 without Inside Ridges"

基礎模型：

0.00%：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
22.07%：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
77.79%："Brick 2 x 2 without Inside Ridges"

儘管在short_caption列上進行了微調，但與在caption列上微調的模型相比，結果非常相似。兩者之間的唯一區別是正確描述和錯誤描述之間的值跨度更大。在這種情況下，基礎模型的表現明顯比使用caption列進行分類時差，並且還分配了錯誤的描述。

在整個數據集上運行相同的任務，選擇一個正確描述和2個隨機描述，結果如下：

比較在short_caption和caption上微調的模型，得到以下結果： results 在short_caption列上微調的模型的準確率為99.99%，而在caption列上微調的模型的準確率為98.48%。

雖然在short_caption列上微調的模型更準確，但兩者之間的權衡在於對正確描述的置信度。由於在caption列上微調的模型在文本搜索方面具有更大的靈活性，因此這裡上傳的是該模型。

基礎模型在遍歷整個數據集時的表現與之前相似，整體準確率仍約為97%。這也表明，所選樣本可能是基礎模型的一個異常值，因為它能夠正確分類大多數其他圖像-文本對。

🔧 技術細節

本模型基於CLIP架構，通過在lego_brick_captions數據集上進行微調，學習到了樂高積木圖像與文字描述之間的關聯。微調過程中，模型在數據集80-20的訓練-驗證分割上進行了7個epoch的訓練，以提高其對樂高積木圖像的分類能力。在推理階段，模型能夠根據輸入的圖像和文字描述，計算它們之間的相似度，並輸出最匹配的結果。