clip - vit - base - patch16 Open Source Model - Supports Cross-modal Understanding of Images and Text, Free to Use

Clip Vit Base Patch16

Developed by Xenova

OpenAI's open-source CLIP model, based on Vision Transformer architecture, supporting cross-modal understanding of images and text

Text-to-Image

Transformers

#Zero-shot Image Classification #Multimodal Embedding #Cross-modal Retrieval

Downloads 32.99k

Release Time : 5/19/2023

Model Overview

A multimodal model based on Vision Transformer architecture that can simultaneously understand image and text content, enabling tasks like zero-shot image classification and cross-modal retrieval

Model Features

Zero-shot Learning Capability

Can perform image classification tasks directly without task-specific training

Cross-modal Understanding

Can process both visual and textual information, computing image-text similarity

Efficient Visual Encoding

Uses 16x16 patch-based Vision Transformer architecture for image input processing

Model Capabilities

Zero-shot image classification

Image-text matching

Cross-modal embedding computation

Visual content understanding

Text content understanding

Use Cases

Content Retrieval

Image-Text Matching Search

Search for relevant images based on text descriptions

Intelligent Classification

Dynamic Image Classification

Perform custom category classification on images without pre-training

Example shows 99.9% accuracy in tiger image classification

🚀 CLIP-ViT-Base-Patch16 with Transformers.js

This project provides the ONNX weights of openai/clip-vit-base-patch16 to make it compatible with Transformers.js, enabling seamless use in JavaScript environments.

🚀 Quick Start

📦 Installation

If you haven't already, you can install the Transformers.js JavaScript library from NPM using the following command:

npm i @xenova/transformers

💻 Usage Examples

🌟 Basic Usage

Perform zero-shot image classification with the `pipeline` API

const classifier = await pipeline('zero-shot-image-classification', 'Xenova/clip-vit-base-patch16');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
const output = await classifier(url, ['tiger', 'horse', 'dog']);
// [
//   { score: 0.9993917942047119, label: 'tiger' },
//   { score: 0.0003519294841680676, label: 'horse' },
//   { score: 0.0002562698791734874, label: 'dog' }
// ]

⚙️ Advanced Usage

Perform zero-shot image classification with `CLIPModel`

import { AutoTokenizer, AutoProcessor, CLIPModel, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const model = await CLIPModel.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
const texts = ['a photo of a car', 'a photo of a football match'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const output = await model({ ...text_inputs, ...image_inputs });
// {
//   logits_per_image: Tensor {
//     dims: [ 1, 2 ],
//     data: Float32Array(2) [ 18.579734802246094, 24.31830596923828 ],
//   },
//   logits_per_text: Tensor {
//     dims: [ 2, 1 ],
//     data: Float32Array(2) [ 18.579734802246094, 24.31830596923828 ],
//   },
//   text_embeds: Tensor {
//     dims: [ 2, 512 ],
//     data: Float32Array(1024) [ ... ],
//   },
//   image_embeds: Tensor {
//     dims: [ 1, 512 ],
//     data: Float32Array(512) [ ... ],
//   }
// }

Compute text embeddings with `CLIPTextModelWithProjection`

import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
const texts = ['a photo of a car', 'a photo of a football match'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
//   dims: [ 2, 512 ],
//   type: 'float32',
//   data: Float32Array(1024) [ ... ],
//   size: 1024
// }

Compute vision embeddings with `CLIPVisionModelWithProjection`

import { AutoProcessor, CLIPVisionModelWithProjection, RawImage } from '@xenova/transformers';

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Read image and run processor
const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
const image_inputs = await processor(image);

// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
//   dims: [ 1, 512 ],
//   type: 'float32',
//   data: Float32Array(512) [ ... ],
//   size: 512
// }

⚠️ Important Note

Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using 🤗 Optimum and structuring your repo like this one (with ONNX weights located in a subfolder named onnx).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご