Open-source clip-vit-base-patch32_lego-brick Model - Accurately identify Lego bricks and corresponding descriptions

Clip Vit Base Patch32 Lego Brick

Developed by armaggheddon97

A CLIP-based fine-tuned model for LEGO brick image-text matching, specifically designed to recognize LEGO bricks and their descriptions.

Text-to-Image

Transformers

EnglishOpen Source License:MIT #LEGO Brick Recognition #Zero-shot Classification #High-precision Matching

Downloads 44

Release Time : 1/24/2025

Model Overview

This model is a fine-tuned CLIP model on a LEGO brick description dataset, used to accurately match LEGO brick images with their corresponding text descriptions, helping users find specific bricks through descriptions or images.

Model Features

High-precision Matching

The model is fine-tuned to accurately match LEGO brick images with text descriptions with high confidence.

Zero-shot Classification

Supports zero-shot image classification, enabling classification of new categories without additional training.

Multimodal Processing

Processes both image and text inputs simultaneously, generating corresponding embedding vectors.

Model Capabilities

Image Classification

Text-Image Matching

Generating Image Embeddings

Generating Text Embeddings

Use Cases

LEGO Brick Recognition

Brick Search

Find specific LEGO bricks by text description or uploaded images.

The model returns the most matching brick results with high confidence.

Zero-shot Classification

Classify new LEGO brick categories without additional training.

Achieves an accuracy of 99.23% on the test dataset.

🚀 clip-vit-base-patch32_lego-brick

A fine-tuned CLIP model for matching Lego brick images with textual descriptions.

🚀 Quick Start

This model is a fine-tuned version of the openai/clip-vit-base-patch32 CLIP model. It's specialized for matching images of Lego bricks with their corresponding textual descriptions.

⚠️ Important Note

If you are interested in the code used, refer to the fine-tuning script on my GitHub.

✨ Features

🔍 Discover the Power of This Model

Ever struggled to figure out the name of that one elusive LEGO brick? Or maybe you’ve got a vague idea or a picture, but the exact part number’s a mystery? That’s where BricksFinder comes in!

Drop in a description like "blue curved slope" or upload an image of the piece, and our model will work its magic to find the closest matches. It’ll show you a list of images with bricks that look just like the one you’re thinking about—or maybe even better!

Web UI

Perfect for LEGO enthusiasts, builders, or anyone who loves a good ol’ treasure hunt among bricks. Check out the live demo on Colab and give it a try!

📦 Installation

This model can be used with the 🤗 transformers library. You can load the model and processor using the following code snippets:

Load the model and processor

import torch
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

Using `Auto` classes

from transformers import AutoModelForZeroShotImageClassification, AutoProcessor

model = AutoModelForZeroShotImageClassification.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
processor = AutoProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

Using with `pipeline`

from transformers import pipeline

model = "armaggheddon97/clip-vit-base-patch32_lego-brick"
clip_classifier = pipeline("zero-shot-image-classification", model=model)

Load in float16 precision

The provided model is in float32 precision. To load the model in float16 precision to speed up inference, you can use the following code snippets:

from transformers import CLIPProcessor, CLIPModel
import torch

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", dtype=torch.float16)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

or alternatively using torch directly:

import torch
from transformers import CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
model_fp16 = model.to(torch.float16)

💻 Usage Examples

Basic Usage

Generating embedding

Embed only the text

import torch
from transformers import CLIPTokenizerFast, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
tokenizer = CLIPTokenizerFast.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

text = ["a photo of a lego brick"]
tokens = tokenizer(text, return_tensors="pt", padding=True).to(device)
outputs = model.get_text_features(**tokens)

Embed only the image

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt").to(device)
outputs = model.get_image_features(**inputs)

Advanced Usage

Zero-shot image classification

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

dataset = load_dataset("armaggheddon97/lego_brick_captions", split="test")

captions = [
    "a photo of a lego brick with a 2x2 plate",
    "a photo of gray minifigure legs",
    "a photo of a brick with a curved slope",
]
image = dataset[0]["image"]

inputs = processor(text=captions, images=image, return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probabilities = logits_per_image.softmax(dim=1)
max_prob_idx = torch.argmax(logits_per_image, dim=1)

📚 Documentation

Model Description

Developed by: The base model has been developed by OpenAI and the fine-tuned model has been developed by me, armaggheddon97.
Model type: The model is a CLIP (Contrastive Language-Image Pretraining) model.
Language: The model expects English text as input.
License: The model is licensed under the MIT license.
Fine-tuned from model clip-vit-base-patch32: The model is a fine-tuned version of the openai/clip-vit-base-patch32 model on the lego_brick_captions dataset. The model has been fine-tuned for 7 epochs on an 80 - 20 train-validation split of the dataset. For more details on the fine-tuning script, take a look at the code on my GitHub.

Results

The goal was to obtain a model that could more accurately distinguish brick images based on their textual description. In terms of accuracy, both models perform similarly. However, when testing on a classification task with the code in the Zero-shot image classification section, the fine-tuned model is able to more accurately classify the images with a much greater level of confidence.

Running the same task across the whole dataset with 1 correct caption (always the first) and 2 randomly sampled ones, results in the following metrics: results

The base model shows poor discrimination capability for the image and text samples, but is still able to correctly assign the correct caption on 97.46% of the samples. The fine-tuned model shows a much higher confidence in the correct caption, with an accuracy of 99.23%.

Finetune on `short_caption`

As an exercise, the model was also fine-tuned on the short_caption column of the dataset. Comparing the results to the one fine-tuned on the caption column, the results are quite similar. The accuracy of the model fine-tuned on the short_caption column is 99.99% while the one fine-tuned on the caption column is 98.48%. The base model performs significantly worse in this case.

The base model performed similarly as before when looping through the entire dataset, with an overall accuracy of ~97%.

🔧 Technical Details

The model is a fine-tuned version of the openai/clip-vit-base-patch32 CLIP model on the lego_brick_captions dataset. The fine-tuning was done for 7 epochs on an 80 - 20 train-validation split of the dataset.

The plot visualizes the normalized text logits produced by the fine-tuned and base models. The input consists of an image of a Lego brick and three captions (one correct and two incorrect). The model generates text logits for each caption, which are then normalized for visualization.

📄 License

The model is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご