Open-Source FashionCLIP Vision-Language Model - Fine-Tuned Specifically for the Fashion Industry to Generate General Product Representations

Fashion Embedder

Developed by McClain

FashionCLIP is a vision-language model based on CLIP, specifically fine-tuned for the fashion domain, capable of generating universal fashion product representations.

Text-to-Image

Transformers

EnglishOpen Source License:MIT #Fashion Product Representation #Zero-shot Transfer #E-commerce Visual Search

Downloads 58

Release Time : 5/16/2024

Model Overview

The model is trained on a dataset of 800,000 fashion products through contrastive learning, aiming to generate transferable product representations for fashion concepts, supporting zero-shot transfer to new datasets and tasks.

Model Features

Fashion Domain Optimization

Fine-tuned on a specialized dataset containing 800,000 fashion products, significantly improving performance on fashion-related tasks

Zero-shot Transfer Capability

The learned representations can be directly transferred to new fashion datasets and tasks without additional training

Improved Version

FashionCLIP 2.0 is based on a superior laion/CLIP checkpoint, with performance surpassing the original version in all aspects

Model Capabilities

Fashion product image classification

Image-text matching

Fashion concept representation generation

Cross-domain zero-shot transfer

Use Cases

E-commerce

Product Search

Match relevant fashion product images through text queries

Improves search accuracy and user experience

Automatic Tag Generation

Automatically generate descriptive tags for fashion product images

Reduces manual labeling costs

Fashion Recommendation

Visual Similarity Recommendation

Recommend similar fashion products based on image similarity

Increases conversion rates and user satisfaction

🚀 Model Card: Fashion CLIP

Fashion CLIP is a CLIP-based model designed to generate general product representations for fashion concepts, leveraging domain-specific fine - tuning for better performance.

🚀 Quick Start

This README provides detailed information about the Fashion CLIP model, including its details, training data, limitations, and citation.

✨ Features

Developed based on the CLIP model to generate general product representations for fashion concepts.
Fine - tuned on a large - scale, high - quality fashion dataset to improve zero - shot transferability.
Compares well with other models in terms of weighted macro F1 score across multiple benchmarks.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Model Details

UPDATE (10/03/23): We have updated the model! We found that laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint (thanks Bin!) worked better than original OpenAI CLIP on Fashion. We thus fine - tune a newer (and better!) version of FashionCLIP (henceforth FashionCLIP 2.0), while keeping the architecture the same. We postulate that the performance gains afforded by laion/CLIP-ViT-B-32-laion2B-s34B-b79K are due to the increased training data (5x OpenAI CLIP data). Our thesis, however, remains the same -- fine - tuning laion/CLIP on our fashion dataset improved zero - shot performance across our benchmarks. See the below table comparing weighted macro F1 score across models.

Model	FMNIST	KAGL	DEEP
OpenAI CLIP	0.66	0.63	0.45
FashionCLIP	0.74	0.67	0.48
Laion CLIP	0.78	0.71	0.58
FashionCLIP 2.0	0.83	0.73	0.62

FashionCLIP is a CLIP - based model developed to produce general product representations for fashion concepts. Leveraging the pre - trained checkpoint (ViT - B/32) released by OpenAI, we train FashionCLIP on a large, high - quality novel fashion dataset to study whether domain specific fine - tuning of CLIP - like models is sufficient to produce product representations that are zero - shot transferable to entirely new datasets and tasks. FashionCLIP was not developed for model deployment - to do so, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.

Model Date

March 2023

Model Type

The model uses a ViT - B/32 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained, starting from a pre - trained checkpoint, to maximize the similarity of (image, text) pairs via a contrastive loss on a fashion dataset containing 800K products.

Documents

Data

The model was trained on (image, text) pairs obtained from the Farfecth dataset[^1 Awaiting official release.], an English dataset comprising over 800K fashion products, with more than 3K brands across dozens of object types. The image used for encoding is the standard product image, which is a picture of the item over a white background, with no humans. The text used is a concatenation of the highlight (e.g., “stripes”, “long sleeves”, “Armani”) and short description (“80s styled t - shirt”)) available in the Farfetch dataset.

Limitations, Bias and Fairness

We acknowledge certain limitations of FashionCLIP and expect that it inherits certain limitations and biases present in the original CLIP model. We do not expect our fine - tuning to significantly augment these limitations: we acknowledge that the fashion data we use makes explicit assumptions about the notion of gender as in "blue shoes for a woman" that inevitably associate aspects of clothing with specific people.

Our investigations also suggest that the data used introduces certain limitations in FashionCLIP. From the textual modality, given that most captions derived from the Farfetch dataset are long, we observe that FashionCLIP may be more performant in longer queries than shorter ones. From the image modality, FashionCLIP is also biased towards standard product images (centered, white background).

Model selection, i.e. selecting an appropriate stopping criteria during fine - tuning, remains an open challenge. We observed that using loss on an in - domain (i.e. same distribution as test) validation dataset is a poor selection criteria when out - of - domain generalization (i.e. across different datasets) is desired, even when the dataset used is relatively diverse and large.

🔧 Technical Details

The model uses a ViT - B/32 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. It is trained on an 800K - product fashion dataset using contrastive loss to maximize the similarity of (image, text) pairs. The choice of the laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint improved the model's performance, likely due to the increased training data.

📄 License

The model is released under the MIT license.

📄 Citation

@Article{Chia2022,
    title="Contrastive language and vision learning of general fashion concepts",
    author="Chia, Patrick John
            and Attanasio, Giuseppe
            and Bianchi, Federico
            and Terragni, Silvia
            and Magalh{\~a}es, Ana Rita
            and Goncalves, Diogo
            and Greco, Ciro
            and Tagliabue, Jacopo",
    journal="Scientific Reports",
    year="2022",
    month="Nov",
    day="08",
    volume="12",
    number="1",
    abstract="The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from general and transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model adapted for the fashion industry. We demonstrate the effectiveness of the representations learned by FashionCLIP with extensive tests across a variety of tasks, datasets and generalization probes. We argue that adaptations of large pre-trained models such as CLIP offer new perspectives in terms of scalability and sustainability for certain types of players in the industry. Finally, we detail the costs and environmental impact of training, and release the model weights and code as open source contribution to the community.",
    issn="2045-2322",
    doi="10.1038/s41598-022-23052-9",
    url="https://doi.org/10.1038/s41598-022-23052-9"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご