đ Marqo-FashionSigLIP Model Card
Marqo-FashionSigLIP is a multimodal embedding model. It offers up to 57% improvement in MRR and recall compared to fashion clip. This model leverages Generalised Contrastive Learning (GCL), enabling it to be trained on various aspects such as text descriptions, categories, style, colors, materials, keywords, and fine - details. As a result, it can provide highly relevant search results for fashion products. The model was fine - tuned from ViT - B - 16 - SigLIP (webli).

Github Page: Marqo-FashionCLIP
Blog: Marqo Blog
đ Quick Start
Marqo-FashionSigLIP is a powerful multimodal embedding model for fashion product search. You can start using it through different libraries as shown below.
⨠Features
- High - performance: Provides up to 57% improvement in MRR and recall over fashion clip.
- Multimodal training: Leverages Generalised Contrastive Learning (GCL) to train on multiple types of data for more relevant search results.
- Fine - tuned model: Fine - tuned from ViT - B - 16 - SigLIP (webli).
đĻ Installation
No specific installation steps are provided in the original README. However, when using the model with different libraries, you need to install the corresponding dependencies. For example, when using with Transformers.js
, you can install it from NPM using:
npm i @huggingface/transformers
đģ Usage Examples
Basic Usage
Hugging Face
The model can be loaded with AutoModel
as follows:
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
import torch
from PIL import Image
image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")
with torch.no_grad():
image_features = model.get_image_features(processed['pixel_values'], normalize=True)
text_features = model.get_text_features(processed['input_ids'], normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Advanced Usage
OpenCLIP
The model can be seamlessly used with OpenCLIP:
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')
import torch
from PIL import Image
image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image, normalize=True)
text_features = model.encode_text(text, normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Transformers.js
You can also run the model in JavaScript with the Transformers.js library:
import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
const model_id = 'Marqo/marqo-fashionSigLIP';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await SiglipTextModel.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await SiglipVisionModel.from_pretrained(model_id);
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
const { text_embeds } = await text_model(text_inputs);
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);
const { image_embeds } = await vision_model(image_inputs);
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];
const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
đ Documentation
Benchmark Results
Average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore) are reported below:
Text - To - Image (Averaged across 6 datasets)
Model |
AvgRecall |
Recall@1 |
Recall@10 |
MRR |
Marqo - FashionSigLIP |
0.231 |
0.121 |
0.340 |
0.239 |
FashionCLIP2.0 |
0.163 |
0.077 |
0.249 |
0.165 |
OpenFashionCLIP |
0.132 |
0.060 |
0.204 |
0.135 |
ViT - B - 16 - laion2b_s34b_b88k |
0.174 |
0.088 |
0.261 |
0.180 |
ViT - B - 16 - SigLIP - webli |
0.212 |
0.111 |
0.314 |
0.214 |
Category - To - Product (Averaged across 5 datasets)
Model |
AvgP |
P@1 |
P@10 |
MRR |
Marqo - FashionSigLIP |
0.737 |
0.758 |
0.716 |
0.812 |
FashionCLIP2.0 |
0.684 |
0.681 |
0.686 |
0.741 |
OpenFashionCLIP |
0.646 |
0.653 |
0.639 |
0.720 |
ViT - B - 16 - laion2b_s34b_b88k |
0.662 |
0.673 |
0.652 |
0.743 |
ViT - B - 16 - SigLIP - webli |
0.688 |
0.690 |
0.685 |
0.751 |
Sub - Category - To - Product (Averaged across 4 datasets)
Model |
AvgP |
P@1 |
P@10 |
MRR |
Marqo - FashionSigLIP |
0.725 |
0.767 |
0.683 |
0.811 |
FashionCLIP2.0 |
0.657 |
0.676 |
0.638 |
0.733 |
OpenFashionCLIP |
0.598 |
0.619 |
0.578 |
0.689 |
ViT - B - 16 - laion2b_s34b_b88k |
0.638 |
0.651 |
0.624 |
0.712 |
ViT - B - 16 - SigLIP - webli |
0.643 |
0.643 |
0.643 |
0.726 |
đ License
This project is licensed under the Apache - 2.0 license.