Marqo-fashionSigLIP-ST Open-source Multimodal Embedding Model - Optimizing Fashion Product Search with Significantly Enhanced Performance

Marqo Fashionsiglip ST

Developed by pySilver

Marqo-FashionSigLIP is a multimodal embedding model optimized for fashion product search, achieving a 57% improvement in MRR and recall rate compared to FashionCLIP.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Fashion Multimodal Retrieval #Zero-shot Classification #SigLIP Fine-tuning

Downloads 3,586

Release Time : 3/3/2025

Model Overview

This model is fine-tuned based on ViT-B-16-SigLIP (webli) and trained using Generalized Contrastive Learning (GCL). It can provide highly relevant search results for fashion products through text descriptions, categories, styles, colors, and more.

Model Features

Multimodal Retrieval

Supports multimodal retrieval of fashion products through text and images.

Generalized Contrastive Learning

Utilizes GCL for training with multiple features such as categories, styles, and colors.

High Performance

Outperforms similar models on multiple fashion datasets with significant improvements in recall rate and MRR.

Model Capabilities

Zero-shot image classification

Multimodal retrieval

Fashion product search

Text-to-image retrieval

Image-to-text retrieval

Use Cases

E-commerce

Fashion Product Search

Search for related fashion products via text descriptions or images.

57% improvement in recall rate and MRR compared to FashionCLIP.

Fashion Recommendation

Personalized Recommendation

Recommend related fashion products based on user input or browsing history.

🚀 Marqo-FashionSigLIP Model Card

Marqo-FashionSigLIP is a multimodal embedding model. It offers up to 57% improvement in MRR and recall compared to fashion clip. This model leverages Generalised Contrastive Learning (GCL), enabling it to be trained on various aspects such as text descriptions, categories, style, colors, materials, keywords, and fine - details. As a result, it can provide highly relevant search results for fashion products. The model was fine - tuned from ViT - B - 16 - SigLIP (webli).

Github Page: Marqo-FashionCLIP

Blog: Marqo Blog

🚀 Quick Start

Marqo-FashionSigLIP is a powerful multimodal embedding model for fashion product search. You can start using it through different libraries as shown below.

✨ Features

High - performance: Provides up to 57% improvement in MRR and recall over fashion clip.
Multimodal training: Leverages Generalised Contrastive Learning (GCL) to train on multiple types of data for more relevant search results.
Fine - tuned model: Fine - tuned from ViT - B - 16 - SigLIP (webli).

📦 Installation

No specific installation steps are provided in the original README. However, when using the model with different libraries, you need to install the corresponding dependencies. For example, when using with Transformers.js, you can install it from NPM using:

npm i @huggingface/transformers

💻 Usage Examples

Basic Usage

Hugging Face

The model can be loaded with AutoModel as follows:

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)

import torch
from PIL import Image

image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")

with torch.no_grad():
    image_features = model.get_image_features(processed['pixel_values'], normalize=True)
    text_features = model.get_text_features(processed['input_ids'], normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.98379946, 0.01294010, 0.00326044]

Advanced Usage

OpenCLIP

The model can be seamlessly used with OpenCLIP:

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]

Transformers.js

You can also run the model in JavaScript with the Transformers.js library:

import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';

const model_id = 'Marqo/marqo-fashionSigLIP';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await SiglipTextModel.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await SiglipVisionModel.from_pretrained(model_id);

// Run tokenization
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);

// Read image and run processor
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);

// Compute similarity scores
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];

const text_probs = softmax(normalized_text_embeds.map((text_embed) => 
    100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
// [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]

📚 Documentation

Benchmark Results

Average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore) are reported below:

Text - To - Image (Averaged across 6 datasets)

Model	AvgRecall	Recall@1	Recall@10	MRR
Marqo - FashionSigLIP	0.231	0.121	0.340	0.239
FashionCLIP2.0	0.163	0.077	0.249	0.165
OpenFashionCLIP	0.132	0.060	0.204	0.135
ViT - B - 16 - laion2b_s34b_b88k	0.174	0.088	0.261	0.180
ViT - B - 16 - SigLIP - webli	0.212	0.111	0.314	0.214

Category - To - Product (Averaged across 5 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionSigLIP	0.737	0.758	0.716	0.812
FashionCLIP2.0	0.684	0.681	0.686	0.741
OpenFashionCLIP	0.646	0.653	0.639	0.720
ViT - B - 16 - laion2b_s34b_b88k	0.662	0.673	0.652	0.743
ViT - B - 16 - SigLIP - webli	0.688	0.690	0.685	0.751

Sub - Category - To - Product (Averaged across 4 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionSigLIP	0.725	0.767	0.683	0.811
FashionCLIP2.0	0.657	0.676	0.638	0.733
OpenFashionCLIP	0.598	0.619	0.578	0.689
ViT - B - 16 - laion2b_s34b_b88k	0.638	0.651	0.624	0.712
ViT - B - 16 - SigLIP - webli	0.643	0.643	0.643	0.726

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご