clip - vit - base - patch16オープンソースモデル - 画像とテキストのクロスモーダル理解をサポート、無料で利用可能

ホーム

Clip Vit Base Patch16

Xenovaによって開発

OpenAIがオープンソース化したCLIPモデル、Vision Transformerアーキテクチャに基づき、画像とテキストのクロスモーダル理解をサポート

テキスト生成画像

Transformers

#ゼロショット画像分類 #マルチモーダル埋め込み #クロスモーダル検索

ダウンロード数 32.99k

リリース時間 : 5/19/2023

モデル概要

Vision Transformerアーキテクチャに基づくマルチモーダルモデルで、画像とテキストの内容を同時に理解し、ゼロショット画像分類やクロスモーダル検索などのタスクを実現

モデル特徴

ゼロショット学習能力

特定のタスク訓練なしで直接画像分類タスクを実行可能

クロスモーダル理解

視覚情報とテキスト情報を同時に処理し、画像-テキスト類似度を計算可能

効率的な視覚エンコーディング

16x16パッチのVision Transformerアーキテクチャで画像入力を処理

モデル能力

ゼロショット画像分類

画像テキストマッチング

クロスモーダル埋め込み計算

視覚コンテンツ理解

テキストコンテンツ理解

使用事例

コンテンツ検索

画像テキストマッチング検索

テキスト記述に基づき関連画像を検索

インテリジェント分類

動的画像分類

事前訓練なしで画像をカスタムカテゴリに分類可能

例ではトラ画像の分類精度が99.9%を達成

🚀 CLIP-ViT-Base-Patch16とTransformers.js

このプロジェクトは、openai/clip-vit-base-patch16 のモデルをONNX形式に変換し、Transformers.jsと互換性を持たせることを目的としています。

🚀 クイックスタート

Transformers.jsを使用するには、まず Transformers.js JavaScriptライブラリを NPM からインストールする必要があります。

npm i @xenova/transformers

💻 使用例

基本的な使用法

pipeline APIを使用してゼロショット画像分類を行う例です。

const classifier = await pipeline('zero-shot-image-classification', 'Xenova/clip-vit-base-patch16');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
const output = await classifier(url, ['tiger', 'horse', 'dog']);
// [
//   { score: 0.9993917942047119, label: 'tiger' },
//   { score: 0.0003519294841680676, label: 'horse' },
//   { score: 0.0002562698791734874, label: 'dog' }
// ]

高度な使用法

`CLIPModel`を使用したゼロショット画像分類

import { AutoTokenizer, AutoProcessor, CLIPModel, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const model = await CLIPModel.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
const texts = ['a photo of a car', 'a photo of a football match'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const output = await model({ ...text_inputs, ...image_inputs });
// {
//   logits_per_image: Tensor {
//     dims: [ 1, 2 ],
//     data: Float32Array(2) [ 18.579734802246094, 24.31830596923828 ],
//   },
//   logits_per_text: Tensor {
//     dims: [ 2, 1 ],
//     data: Float32Array(2) [ 18.579734802246094, 24.31830596923828 ],
//   },
//   text_embeds: Tensor {
//     dims: [ 2, 512 ],
//     data: Float32Array(1024) [ ... ],
//   },
//   image_embeds: Tensor {
//     dims: [ 1, 512 ],
//     data: Float32Array(512) [ ... ],
//   }
// }

`CLIPTextModelWithProjection`を使用したテキスト埋め込みの計算

import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
const texts = ['a photo of a car', 'a photo of a football match'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
//   dims: [ 2, 512 ],
//   type: 'float32',
//   data: Float32Array(1024) [ ... ],
//   size: 1024
// }

`CLIPVisionModelWithProjection`を使用したビジョン埋め込みの計算

import { AutoProcessor, CLIPVisionModelWithProjection, RawImage } from '@xenova/transformers';

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);

// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
//   dims: [ 1, 512 ],
//   type: 'float32',
//   data: Float32Array(512) [ ... ],
//   size: 512
// }

📚 ドキュメント

⚠️ 重要提示

ONNX重み用に別のリポジトリを用意するのは、WebMLが普及するまでの一時的な解決策です。モデルをWeb対応にする場合は、 🤗 Optimum を使用してONNXに変換し、このリポジトリのように構成することをお勧めします（ONNX重みは onnx というサブフォルダに配置します）。