marqo-fashionSigLIP Open-Source Fashion Multimodal Retrieval Model - Precise Support for Fashion Product Search

Marqo Fashionsiglip

Developed by Styld

A fine-tuned fashion multimodal retrieval model based on ViT-B-16-SigLIP, specializing in fashion product search

EnglishOpen Source License:Apache-2.0 #Fashion Multimodal Retrieval #Zero-shot Classification #E-commerce Search Optimization

Downloads 39

Release Time : 8/21/2024

Model Overview

This model utilizes Generalized Contrastive Learning (GCL) and can be trained based on various features such as text descriptions, categories, styles, colors, etc., providing highly relevant search results for fashion products

Model Features

Generalized Contrastive Learning

Supports training based on various features including text descriptions, categories, styles, colors, materials, etc.

Fashion Domain Optimization

Specially fine-tuned for fashion product search scenarios, delivering more accurate retrieval results

Multimodal Support

Supports both image and text inputs, enabling cross-modal retrieval

Model Capabilities

Zero-shot image classification

Text-to-image retrieval

Image-to-text retrieval

Fashion product search

Multimodal feature extraction

Use Cases

E-commerce

Fashion Product Search

Search for relevant fashion products based on text descriptions or categories

Outperforms similar models on multiple fashion datasets

Visual Similarity Search

Find fashion products with similar styles based on example images

🚀 Marqo-FashionSigLIP Model Card

Marqo-FashionSigLIP utilizes Generalised Contrastive Learning (GCL). This enables the model to be trained on not only text descriptions but also categories, styles, colors, materials, keywords, and fine - details. As a result, it can provide highly relevant search results for fashion products. The model is fine - tuned from ViT - B - 16 - SigLIP (webli).

Github Page: Marqo-FashionCLIP

Blog: Marqo Blog

🚀 Quick Start

The model can be used with OpenCLIP effortlessly.

💻 Usage Examples

Basic Usage

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📚 Documentation

Benchmark Results

The following are the average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore):

Text - To - Image (Averaged across 6 datasets)

Model	AvgRecall	Recall@1	Recall@10	MRR
Marqo - FashionSigLIP	0.231	0.121	0.340	0.239
FashionCLIP2.0	0.163	0.077	0.249	0.165
OpenFashionCLIP	0.132	0.060	0.204	0.135
ViT - B - 16 - laion2b_s34b_b88k	0.174	0.088	0.261	0.180
ViT - B - 16 - SigLIP - webli	0.212	0.111	0.314	0.214

Category - To - Product (Averaged across 5 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionSigLIP	0.737	0.758	0.716	0.812
FashionCLIP2.0	0.684	0.681	0.686	0.741
OpenFashionCLIP	0.646	0.653	0.639	0.720
ViT - B - 16 - laion2b_s34b_b88k	0.662	0.673	0.652	0.743
ViT - B - 16 - SigLIP - webli	0.688	0.690	0.685	0.751

Sub - Category - To - Product (Averaged across 4 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionSigLIP	0.725	0.767	0.683	0.811
FashionCLIP2.0	0.657	0.676	0.638	0.733
OpenFashionCLIP	0.598	0.619	0.578	0.689
ViT - B - 16 - laion2b_s34b_b88k	0.638	0.651	0.624	0.712
ViT - B - 16 - SigLIP - webli	0.643	0.643	0.643	0.726

📄 License

This model is licensed under the apache - 2.0 license.

📋 Information Table

Property	Details
Tags	clip, e - commerce, fashion, multimodal retrieval, siglip
Library Name	open_clip
Pipeline Tag	zero - shot - image - classification
License	apache - 2.0
Datasets	Marqo/atlas, Marqo/deepfashion - inshop, Marqo/deepfashion - multimodal, Marqo/fashion200k, Marqo/iMaterialist, Marqo/KAGL, Marqo/polyvore
Language	en
Metrics	precision, recall, MRR

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご