đ Marqo-FashionSigLIP Model Card
Marqo-FashionSigLIP is a multimodal embedding model. It offers up to 57% improvement in MRR and recall compared to fashion clip. This model utilizes Generalised Contrastive Learning (GCL), enabling it to be trained on not only text descriptions but also categories, style, colors, materials, keywords, and fine - details. As a result, it can provide highly relevant search results for fashion products. The model was fine - tuned from ViT - B - 16 - SigLIP (webli).
Github Page: Marqo-FashionCLIP
Blog: Marqo Blog
đ Quick Start
⨠Features
- Multimodal embedding for fashion product search.
- Utilizes Generalised Contrastive Learning for better training.
- Fine - tuned from ViT - B - 16 - SigLIP (webli).
đĻ Installation
Not provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The model can be used in different ways. Here are some examples:
Hugging Face
The model can be loaded with AutoModel by
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
import torch
from PIL import Image
image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")
with torch.no_grad():
image_features = model.get_image_features(processed['pixel_values'], normalize=True)
text_features = model.get_text_features(processed['input_ids'], normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
OpenCLIP
The model can be seamlessly used with OpenCLIP by
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')
import torch
from PIL import Image
image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image, normalize=True)
text_features = model.encode_text(text, normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Transformers.js
You can also run the model in JavaScript with the Transformers.js library.
First, install it from NPM using:
npm i @huggingface/transformers
Then, compute embeddings as follows:
import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
const model_id = 'Marqo/marqo-fashionSigLIP';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await SiglipTextModel.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await SiglipVisionModel.from_pretrained(model_id);
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
const { text_embeds } = await text_model(text_inputs);
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);
const { image_embeds } = await vision_model(image_inputs);
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];
const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
đ Documentation
The model has been benchmarked on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore). The average evaluation results are as follows:
Text - To - Image (Averaged across 6 datasets)
Property |
Details |
Model Type |
Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli |
AvgRecall |
Marqo - FashionSigLIP: 0.231; FashionCLIP2.0: 0.163; OpenFashionCLIP: 0.132; ViT - B - 16 - laion2b_s34b_b88k: 0.174; ViT - B - 16 - SigLIP - webli: 0.212 |
Recall@1 |
Marqo - FashionSigLIP: 0.121; FashionCLIP2.0: 0.077; OpenFashionCLIP: 0.060; ViT - B - 16 - laion2b_s34b_b88k: 0.088; ViT - B - 16 - SigLIP - webli: 0.111 |
Recall@10 |
Marqo - FashionSigLIP: 0.340; FashionCLIP2.0: 0.249; OpenFashionCLIP: 0.204; ViT - B - 16 - laion2b_s34b_b88k: 0.261; ViT - B - 16 - SigLIP - webli: 0.314 |
MRR |
Marqo - FashionSigLIP: 0.239; FashionCLIP2.0: 0.165; OpenFashionCLIP: 0.135; ViT - B - 16 - laion2b_s34b_b88k: 0.180; ViT - B - 16 - SigLIP - webli: 0.214 |
Category - To - Product (Averaged across 5 datasets)
Property |
Details |
Model Type |
Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli |
AvgP |
Marqo - FashionSigLIP: 0.737; FashionCLIP2.0: 0.684; OpenFashionCLIP: 0.646; ViT - B - 16 - laion2b_s34b_b88k: 0.662; ViT - B - 16 - SigLIP - webli: 0.688 |
P@1 |
Marqo - FashionSigLIP: 0.758; FashionCLIP2.0: 0.681; OpenFashionCLIP: 0.653; ViT - B - 16 - laion2b_s34b_b88k: 0.673; ViT - B - 16 - SigLIP - webli: 0.690 |
P@10 |
Marqo - FashionSigLIP: 0.716; FashionCLIP2.0: 0.686; OpenFashionCLIP: 0.639; ViT - B - 16 - laion2b_s34b_b88k: 0.652; ViT - B - 16 - SigLIP - webli: 0.685 |
MRR |
Marqo - FashionSigLIP: 0.812; FashionCLIP2.0: 0.741; OpenFashionCLIP: 0.720; ViT - B - 16 - laion2b_s34b_b88k: 0.743; ViT - B - 16 - SigLIP - webli: 0.751 |
Sub - Category - To - Product (Averaged across 4 datasets)
Property |
Details |
Model Type |
Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli |
AvgP |
Marqo - FashionSigLIP: 0.725; FashionCLIP2.0: 0.657; OpenFashionCLIP: 0.598; ViT - B - 16 - laion2b_s34b_b88k: 0.638; ViT - B - 16 - SigLIP - webli: 0.643 |
P@1 |
Marqo - FashionSigLIP: 0.767; FashionCLIP2.0: 0.676; OpenFashionCLIP: 0.619; ViT - B - 16 - laion2b_s34b_b88k: 0.651; ViT - B - 16 - SigLIP - webli: 0.643 |
P@10 |
Marqo - FashionSigLIP: 0.683; FashionCLIP2.0: 0.638; OpenFashionCLIP: 0.578; ViT - B - 16 - laion2b_s34b_b88k: 0.624; ViT - B - 16 - SigLIP - webli: 0.643 |
MRR |
Marqo - FashionSigLIP: 0.811; FashionCLIP2.0: 0.733; OpenFashionCLIP: 0.689; ViT - B - 16 - laion2b_s34b_b88k: 0.712; ViT - B - 16 - SigLIP - webli: 0.726 |
đ§ Technical Details
Not provided in the original document, so this section is skipped.
đ License
The model is licensed under the apache - 2.0 license.