marqo-fashionSigLIP開源時尚多模態檢索模型

首頁

Marqo Fashionsiglip

由Styld開發

基於ViT-B-16-SigLIP微調的時尚多模態檢索模型，專注於時尚產品搜索

文本生成圖像

Safetensors

英語開源協議:Apache-2.0 #時尚多模態檢索 #零樣本分類 #電商搜索優化

下載量 39

發布時間 : 8/21/2024

模型概述

該模型利用廣義對比學習(GCL)，能夠基於文本描述、類別、風格、顏色等多種特徵進行訓練，為時尚產品提供高度相關的搜索結果

模型特點

廣義對比學習

支持基於文本描述、類別、風格、顏色、材質等多種特徵的訓練

時尚領域優化

專門針對時尚產品搜索場景進行微調，提供更精準的檢索結果

多模態支持

同時支持圖像和文本輸入，實現跨模態檢索

模型能力

零樣本圖像分類

文本到圖像檢索

圖像到文本檢索

時尚產品搜索

多模態特徵提取

使用案例

電子商務

時尚產品搜索

根據文本描述或類別搜索相關時尚產品

在多個時尚數據集上表現優於同類模型

視覺相似性搜索

根據示例圖片查找相似風格的時尚產品

🚀 Marqo-FashionSigLIP模型卡片

Marqo-FashionSigLIP利用廣義對比學習（GCL），使模型不僅能基於文本描述進行訓練，還能結合類別、風格、顏色、材質、關鍵詞和細節信息，從而為時尚產品提供高度相關的搜索結果。該模型是在ViT - B - 16 - SigLIP (webli)的基礎上進行微調得到的。

GitHub頁面：Marqo-FashionCLIP

博客：Marqo博客

🚀 快速開始

✨ 主要特性

支持標籤：clip、e - commerce、fashion、multimodal retrieval、siglip
支持的庫：OpenCLIP
任務類型：零樣本圖像分類
許可證：Apache - 2.0
訓練數據集：Marqo/atlas、Marqo/deepfashion - inshop、Marqo/deepfashion - multimodal、Marqo/fashion200k、Marqo/iMaterialist、Marqo/KAGL、Marqo/polyvore
評估指標：precision、recall、MRR

📦 安裝指南

此模型可通過OpenCLIP無縫使用，示例代碼如下：

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

💻 使用示例

基礎用法

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📚 詳細文檔

基準測試結果

以下是該模型在6個公開多模態時尚數據集（Atlas、[DeepFashion (In - shop)](https://huggingface.co/datasets/Marqo/deepfashion - inshop)、[DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion - multimodal)、Fashion200k、KAGL和Polyvore）上的平均評估結果：

文本到圖像（6個數據集的平均值）

模型	平均召回率	召回率@1	召回率@10	MRR
Marqo - FashionSigLIP	0.231	0.121	0.340	0.239
FashionCLIP2.0	0.163	0.077	0.249	0.165
OpenFashionCLIP	0.132	0.060	0.204	0.135
ViT - B - 16 - laion2b_s34b_b88k	0.174	0.088	0.261	0.180
ViT - B - 16 - SigLIP - webli	0.212	0.111	0.314	0.214

類別到產品（5個數據集的平均值）

模型	平均準確率	準確率@1	準確率@10	MRR
Marqo - FashionSigLIP	0.737	0.758	0.716	0.812
FashionCLIP2.0	0.684	0.681	0.686	0.741
OpenFashionCLIP	0.646	0.653	0.639	0.720
ViT - B - 16 - laion2b_s34b_b88k	0.662	0.673	0.652	0.743
ViT - B - 16 - SigLIP - webli	0.688	0.690	0.685	0.751

子類別到產品（4個數據集的平均值）

模型	平均準確率	準確率@1	準確率@10	MRR
Marqo - FashionSigLIP	0.725	0.767	0.683	0.811
FashionCLIP2.0	0.657	0.676	0.638	0.733
OpenFashionCLIP	0.598	0.619	0.578	0.689
ViT - B - 16 - laion2b_s34b_b88k	0.638	0.651	0.624	0.712
ViT - B - 16 - SigLIP - webli	0.643	0.643	0.643	0.726