Model Overview
Model Features
Model Capabilities
Use Cases
🚀 clip-vit-base-patch32_lego-brick
A fine-tuned CLIP model for matching Lego brick images with textual descriptions.
🚀 Quick Start
This model is a fine-tuned version of the openai/clip-vit-base-patch32
CLIP model. It's specialized for matching images of Lego bricks with their corresponding textual descriptions.
⚠️ Important Note
If you are interested in the code used, refer to the fine-tuning script on my GitHub.
✨ Features
🔍 Discover the Power of This Model
Ever struggled to figure out the name of that one elusive LEGO brick? Or maybe you’ve got a vague idea or a picture, but the exact part number’s a mystery? That’s where BricksFinder comes in!
Drop in a description like "blue curved slope" or upload an image of the piece, and our model will work its magic to find the closest matches. It’ll show you a list of images with bricks that look just like the one you’re thinking about—or maybe even better!
Perfect for LEGO enthusiasts, builders, or anyone who loves a good ol’ treasure hunt among bricks. Check out the live demo on Colab and give it a try!
📦 Installation
This model can be used with the 🤗 transformers
library. You can load the model and processor using the following code snippets:
Load the model and processor
import torch
from transformers import CLIPProcessor, CLIPModel
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
Using Auto
classes
from transformers import AutoModelForZeroShotImageClassification, AutoProcessor
model = AutoModelForZeroShotImageClassification.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
processor = AutoProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
Using with pipeline
from transformers import pipeline
model = "armaggheddon97/clip-vit-base-patch32_lego-brick"
clip_classifier = pipeline("zero-shot-image-classification", model=model)
Load in float16 precision
The provided model is in float32 precision. To load the model in float16 precision to speed up inference, you can use the following code snippets:
from transformers import CLIPProcessor, CLIPModel
import torch
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", dtype=torch.float16)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
or alternatively using torch
directly:
import torch
from transformers import CLIPModel
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
model_fp16 = model.to(torch.float16)
💻 Usage Examples
Basic Usage
Generating embedding
Embed only the text
import torch
from transformers import CLIPTokenizerFast, CLIPModel
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
tokenizer = CLIPTokenizerFast.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
text = ["a photo of a lego brick"]
tokens = tokenizer(text, return_tensors="pt", padding=True).to(device)
outputs = model.get_text_features(**tokens)
Embed only the image
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt").to(device)
outputs = model.get_image_features(**inputs)
Advanced Usage
Zero-shot image classification
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from datasets import load_dataset
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
dataset = load_dataset("armaggheddon97/lego_brick_captions", split="test")
captions = [
"a photo of a lego brick with a 2x2 plate",
"a photo of gray minifigure legs",
"a photo of a brick with a curved slope",
]
image = dataset[0]["image"]
inputs = processor(text=captions, images=image, return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probabilities = logits_per_image.softmax(dim=1)
max_prob_idx = torch.argmax(logits_per_image, dim=1)
📚 Documentation
Model Description
- Developed by: The base model has been developed by OpenAI and the fine-tuned model has been developed by me, armaggheddon97.
- Model type: The model is a CLIP (Contrastive Language-Image Pretraining) model.
- Language: The model expects English text as input.
- License: The model is licensed under the MIT license.
- Fine-tuned from model clip-vit-base-patch32: The model is a fine-tuned version of the
openai/clip-vit-base-patch32
model on thelego_brick_captions
dataset. The model has been fine-tuned for 7 epochs on an 80 - 20 train-validation split of the dataset. For more details on the fine-tuning script, take a look at the code on my GitHub.
Results
The goal was to obtain a model that could more accurately distinguish brick images based on their textual description. In terms of accuracy, both models perform similarly. However, when testing on a classification task with the code in the Zero-shot image classification section, the fine-tuned model is able to more accurately classify the images with a much greater level of confidence.
Running the same task across the whole dataset with 1 correct caption (always the first) and 2 randomly sampled ones, results in the following metrics:
The base model shows poor discrimination capability for the image and text samples, but is still able to correctly assign the correct caption on 97.46% of the samples. The fine-tuned model shows a much higher confidence in the correct caption, with an accuracy of 99.23%.
Finetune on short_caption
As an exercise, the model was also fine-tuned on the short_caption
column of the dataset. Comparing the results to the one fine-tuned on the caption
column, the results are quite similar. The accuracy of the model fine-tuned on the short_caption
column is 99.99% while the one fine-tuned on the caption
column is 98.48%. The base model performs significantly worse in this case.
The base model performed similarly as before when looping through the entire dataset, with an overall accuracy of ~97%.
🔧 Technical Details
The model is a fine-tuned version of the openai/clip-vit-base-patch32
CLIP model on the lego_brick_captions
dataset. The fine-tuning was done for 7 epochs on an 80 - 20 train-validation split of the dataset.
The plot visualizes the normalized text logits produced by the fine-tuned and base models. The input consists of an image of a Lego brick and three captions (one correct and two incorrect). The model generates text logits for each caption, which are then normalized for visualization.
📄 License
The model is licensed under the MIT license.







