開源MobileCLIP S2輕量級模型 - 高效完成圖像特徵提取與零樣本分類

首頁

Mobileclip S2

由Xenova開發

MobileCLIP S2 是一個輕量級的視覺-語言模型，專注於圖像特徵提取和零樣本圖像分類任務。

文本生成圖像

Transformers

開源協議:其他 #零樣本圖像分類 #移動端優化 #多標籤識別

下載量 86

發布時間 : 4/24/2024

模型概述

MobileCLIP S2 是一個高效的視覺-語言模型，支持圖像特徵提取和零樣本圖像分類。它基於 CLIP 架構，但經過優化以適應移動設備部署。

模型特點

輕量級設計

專為移動設備優化，具有較小的模型尺寸和高效的計算性能。

零樣本分類

無需特定訓練即可對新類別進行圖像分類。

ONNX 兼容性

提供 ONNX 格式權重，便於在不同平臺上部署。

模型能力

圖像特徵提取

零樣本圖像分類

跨模態檢索

使用案例

圖像分類

動物識別

識別圖像中的動物類別（如貓、狗、鳥等）

高準確率的零樣本分類能力

內容檢索

基於文本的圖像搜索

使用文本描述檢索相關圖像

高效的跨模態檢索能力

🚀 transformers.js

transformers.js 是一個與 ONNX 權重兼容的庫，基於 apple/ml-mobileclip 項目，可用於零樣本圖像分類等任務，為圖像特徵提取等工作提供支持。

🚀 快速開始

安裝

若你還未安裝 Transformers.js JavaScript 庫，可以通過以下命令從 NPM 進行安裝：

npm i @huggingface/transformers

💻 使用示例

基礎用法

以下是一個使用 transformers.js 執行零樣本圖像分類的示例：

import {
  AutoTokenizer,
  CLIPTextModelWithProjection,
  AutoProcessor,
  CLIPVisionModelWithProjection,
  RawImage,
  dot,
  softmax,
} from '@huggingface/transformers';

const model_id = 'Xenova/mobileclip_s2';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);

// Run tokenization
const texts = ['cats', 'dogs', 'birds'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
const normalized_text_embeds = text_embeds.normalize().tolist();

// Read image and run processor
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
const image = await RawImage.read(url);
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
const normalized_image_embeds = image_embeds.normalize().tolist();

// Compute probabilities
const probabilities = normalized_image_embeds.map(
  x => softmax(normalized_text_embeds.map(y => 100 * dot(x, y)))
);
console.log(probabilities); // [[ 0.9999973851268408, 0.000002399646544186113, 2.1522661499262862e-7 ]]