Marqo-fashionSigLIP Open Source Multimodal Embedding Model - Optimize Fashion Product Search with Significantly Improved Performance

Marqo Fashionsiglip

Developed by Marqo

Marqo-FashionSigLIP is a multimodal embedding model optimized for fashion product search, with a 57% improvement in MRR and recall rate compared to FashionCLIP.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Fashion multimodal retrieval #Zero-shot classification #E-commerce search optimization

Downloads 493.25k

Release Time : 8/9/2024

Model Overview

This model is trained using generalized contrastive learning and supports the retrieval of fashion products based on various features such as text descriptions, categories, styles, colors, and materials, providing highly relevant search results.

Model Features

Generalized contrastive learning

Trained using generalized contrastive learning (GCL), supporting multimodal retrieval and ranking to improve search relevance.

Multimodal embedding

Capable of processing both image and text inputs simultaneously to generate a unified embedding representation.

Optimized for the fashion domain

Specifically optimized for fashion products and performs well on multiple fashion datasets.

Model Capabilities

Zero-shot image classification

Multimodal retrieval

Fashion product search

Image-text matching

Use Cases

E-commerce

Fashion product search

Search for relevant fashion products based on text descriptions or images.

The average recall rate has been improved by 57% on multiple fashion datasets.

Product classification

Automatically classify fashion products.

The average precision rate in the category-to-product task reaches 0.737.

🚀 Marqo-FashionSigLIP Model Card

Marqo-FashionSigLIP is a multimodal embedding model. It offers up to 57% improvement in MRR and recall compared to fashion clip. This model utilizes Generalised Contrastive Learning (GCL), enabling it to be trained on not only text descriptions but also categories, style, colors, materials, keywords, and fine - details. As a result, it can provide highly relevant search results for fashion products. The model was fine - tuned from ViT - B - 16 - SigLIP (webli).

Github Page: Marqo-FashionCLIP

Blog: Marqo Blog

🚀 Quick Start

✨ Features

Multimodal embedding for fashion product search.
Utilizes Generalised Contrastive Learning for better training.
Fine - tuned from ViT - B - 16 - SigLIP (webli).

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be used in different ways. Here are some examples:

Hugging Face

The model can be loaded with AutoModel by

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionSigLIP', trust_remote_code=True)

import torch
from PIL import Image

image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")

with torch.no_grad():
    image_features = model.get_image_features(processed['pixel_values'], normalize=True)
    text_features = model.get_text_features(processed['input_ids'], normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.98379946, 0.01294010, 0.00326044]

OpenCLIP

The model can be seamlessly used with OpenCLIP by

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]

Transformers.js

You can also run the model in JavaScript with the Transformers.js library.

First, install it from NPM using:

npm i @huggingface/transformers

Then, compute embeddings as follows:

import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';

const model_id = 'Marqo/marqo-fashionSigLIP';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await SiglipTextModel.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await SiglipVisionModel.from_pretrained(model_id);

// Run tokenization
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);

// Read image and run processor
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);

// Compute similarity scores
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];

const text_probs = softmax(normalized_text_embeds.map((text_embed) => 
    100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
// [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]

📚 Documentation

The model has been benchmarked on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore). The average evaluation results are as follows:

Text - To - Image (Averaged across 6 datasets)

Property	Details
Model Type	Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli
AvgRecall	Marqo - FashionSigLIP: 0.231; FashionCLIP2.0: 0.163; OpenFashionCLIP: 0.132; ViT - B - 16 - laion2b_s34b_b88k: 0.174; ViT - B - 16 - SigLIP - webli: 0.212
Recall@1	Marqo - FashionSigLIP: 0.121; FashionCLIP2.0: 0.077; OpenFashionCLIP: 0.060; ViT - B - 16 - laion2b_s34b_b88k: 0.088; ViT - B - 16 - SigLIP - webli: 0.111
Recall@10	Marqo - FashionSigLIP: 0.340; FashionCLIP2.0: 0.249; OpenFashionCLIP: 0.204; ViT - B - 16 - laion2b_s34b_b88k: 0.261; ViT - B - 16 - SigLIP - webli: 0.314
MRR	Marqo - FashionSigLIP: 0.239; FashionCLIP2.0: 0.165; OpenFashionCLIP: 0.135; ViT - B - 16 - laion2b_s34b_b88k: 0.180; ViT - B - 16 - SigLIP - webli: 0.214

Category - To - Product (Averaged across 5 datasets)

Property	Details
Model Type	Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli
AvgP	Marqo - FashionSigLIP: 0.737; FashionCLIP2.0: 0.684; OpenFashionCLIP: 0.646; ViT - B - 16 - laion2b_s34b_b88k: 0.662; ViT - B - 16 - SigLIP - webli: 0.688
P@1	Marqo - FashionSigLIP: 0.758; FashionCLIP2.0: 0.681; OpenFashionCLIP: 0.653; ViT - B - 16 - laion2b_s34b_b88k: 0.673; ViT - B - 16 - SigLIP - webli: 0.690
P@10	Marqo - FashionSigLIP: 0.716; FashionCLIP2.0: 0.686; OpenFashionCLIP: 0.639; ViT - B - 16 - laion2b_s34b_b88k: 0.652; ViT - B - 16 - SigLIP - webli: 0.685
MRR	Marqo - FashionSigLIP: 0.812; FashionCLIP2.0: 0.741; OpenFashionCLIP: 0.720; ViT - B - 16 - laion2b_s34b_b88k: 0.743; ViT - B - 16 - SigLIP - webli: 0.751

Sub - Category - To - Product (Averaged across 4 datasets)

Property	Details
Model Type	Marqo - FashionSigLIP, FashionCLIP2.0, OpenFashionCLIP, ViT - B - 16 - laion2b_s34b_b88k, ViT - B - 16 - SigLIP - webli
AvgP	Marqo - FashionSigLIP: 0.725; FashionCLIP2.0: 0.657; OpenFashionCLIP: 0.598; ViT - B - 16 - laion2b_s34b_b88k: 0.638; ViT - B - 16 - SigLIP - webli: 0.643
P@1	Marqo - FashionSigLIP: 0.767; FashionCLIP2.0: 0.676; OpenFashionCLIP: 0.619; ViT - B - 16 - laion2b_s34b_b88k: 0.651; ViT - B - 16 - SigLIP - webli: 0.643
P@10	Marqo - FashionSigLIP: 0.683; FashionCLIP2.0: 0.638; OpenFashionCLIP: 0.578; ViT - B - 16 - laion2b_s34b_b88k: 0.624; ViT - B - 16 - SigLIP - webli: 0.643
MRR	Marqo - FashionSigLIP: 0.811; FashionCLIP2.0: 0.733; OpenFashionCLIP: 0.689; ViT - B - 16 - laion2b_s34b_b88k: 0.712; ViT - B - 16 - SigLIP - webli: 0.726

🔧 Technical Details

Not provided in the original document, so this section is skipped.

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご