Marqo-FashionCLIP Open-Source Fashion Multimodal Retrieval Model

Marqo Fashionclip

Developed by Marqo

Marqo-FashionCLIP is a fashion-domain multimodal retrieval model based on the CLIP architecture, achieving state-of-the-art performance in fashion product search tasks through generalized contrastive learning.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Fashion Multimodal Retrieval #Zero-shot Classification #E-commerce Search Optimization

Downloads 8,376

Release Time : 8/8/2024

Model Overview

This model is specifically optimized for the fashion domain, capable of processing both image and text inputs for zero-shot image classification and multimodal retrieval tasks. It surpasses previous SOTA models on multiple fashion datasets.

Model Features

Generalized Contrastive Learning

Utilizes GCL method to train not only on text descriptions but also on multi-dimensional features such as categories, styles, and colors.

Fashion Domain Optimization

Specifically fine-tuned for fashion product search tasks, demonstrating excellent performance on multiple fashion datasets.

Multi-framework Support

Supports various usage methods including Hugging Face, OpenCLIP, and Transformers.js.

Model Capabilities

Zero-shot image classification

Text-to-image retrieval

Image-to-text retrieval

Multimodal feature extraction

Use Cases

E-commerce

Fashion Product Search

Find relevant fashion products based on text descriptions or categories

Surpasses previous state-of-the-art models on multiple fashion datasets

Visual Similarity Search

Find visually similar fashion products based on images

Content Management

Automatic Product Tagging

Automatically generate labels and descriptions for fashion product images

🚀 Marqo-FashionCLIP Model Card

Marqo-FashionCLIP and Marqo-FashionSigLIP outperform previous state - of - the - art fashion CLIP models. Marqo-FashionCLIP uses Generalised Contrastive Learning (GCL) to train on various fashion - related information, providing highly relevant search results for fashion products. It is fine - tuned from ViT - B - 16 (laion2b_s34b_b88k).

Github Page: Marqo-FashionCLIP

Blog: Marqo Blog

🚀 Quick Start

✨ Features

Tags: clip, e - commerce, fashion, multimodal retrieval, transformers.js, transformers
Library Name: open_clip
Pipeline Tag: zero - shot - image - classification
License: apache - 2.0
Language: en
Metrics: precision, recall, MRR

Property	Details
Model Type	Marqo-FashionCLIP
Training Data	Not specified

💻 Usage Examples

Basic Usage

Hugging Face

The model can be loaded with AutoModel by

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionCLIP', trust_remote_code=True)

import torch
from PIL import Image

image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")

with torch.no_grad():
    image_features = model.get_image_features(processed['pixel_values'], normalize=True)
    text_features = model.get_text_features(processed['input_ids'], normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.99990773, 0.00006382, 0.00002847]

Advanced Usage

OpenCLIP

The model can be seamlessly used with OpenCLIP by

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image, normalize=True)
    text_features = model.encode_text(text, normalize=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]

Transformers.js

You can also run the model in JavaScript with the Transformers.js library.

First, install it from NPM using:

npm i @huggingface/transformers

Then, compute embeddings as follows:

import { CLIPTextModelWithProjection, CLIPVisionModelWithProjection, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';

const model_id = 'Marqo/marqo-fashionCLIP';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);

// Run tokenization
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);

// Read image and run processor
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);

// Compute similarity scores
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];

const text_probs = softmax(normalized_text_embeds.map((text_embed) => 
    100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
// [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]

📚 Documentation

Benchmark Results

Average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In - shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore) are reported below:

Text - To - Image (Averaged across 6 datasets)

Model	AvgRecall	Recall@1	Recall@10	MRR
Marqo - FashionCLIP	0.192	0.094	0.290	0.200
FashionCLIP2.0	0.163	0.077	0.249	0.165
OpenFashionCLIP	0.132	0.060	0.204	0.135
ViT - B - 16 - laion2b_s34b_b88k	0.174	0.088	0.261	0.180

Category - To - Product (Averaged across 5 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionCLIP	0.705	0.734	0.676	0.776
FashionCLIP2.0	0.684	0.681	0.686	0.741
OpenFashionCLIP	0.646	0.653	0.639	0.720
ViT - B - 16 - laion2b_s34b_b88k	0.662	0.673	0.652	0.743

Sub - Category - To - Product (Averaged across 4 datasets)

Model	AvgP	P@1	P@10	MRR
Marqo - FashionCLIP	0.707	0.747	0.667	0.772
FashionCLIP2.0	0.657	0.676	0.638	0.733
OpenFashionCLIP	0.598	0.619	0.578	0.689
ViT - B - 16 - laion2b_s34b_b88k	0.638	0.651	0.624	0.712

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご