Open-source nomic-embed-vision-v1.5 Visual Embedding Model - The High-performance Choice for Supporting Multimodal Applications

Nomic Embed Vision V1.5

Developed by nomic-ai

High-performance visual embedding model, sharing the same embedding space with nomic-embed-text-v1.5, supporting multimodal applications

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Embedding #Zero-shot Learning #Cross-modal Retrieval

Downloads 27.85k

Release Time : 6/1/2024

Model Overview

nomic-embed-vision-v1.5 is a high-performance visual embedding model capable of converting images into embedding vectors and aligning them with text embedding space for multimodal retrieval and analysis.

Model Features

Multimodal Support

Shares the same embedding space with nomic-embed-text-v1.5, enabling joint retrieval of text and images

High Performance

Outperforms peer models on benchmarks like Imagenet zero-shot and Datacomp

Easy Integration

Provides simple API and transformers integration for rapid deployment

Model Capabilities

Image feature extraction

Multimodal retrieval

Text-to-image search

Image similarity calculation

Use Cases

Information Retrieval

Multimodal RAG

Retrieve relevant images using text queries

Achieves efficient cross-modal retrieval

Data Analysis

Data Visualization

Project image and text embeddings into the same space for visual analysis

Displays visualization effects of 100K sample CC3M dataset on Atlas platform

🚀 nomic-embed-vision-v1.5: Expanding the Latent Space

nomic-embed-vision-v1.5 is a high - performing vision embedding model. It shares the same embedding space as nomic-embed-text-v1.5. All Nomic Embed Text models are now multimodal!

🚀 Quick Start

Blog | Technical Report | AWS SageMaker | Atlas Embedding and Unstructured Data Analytics Platform

✨ Features

Name	Imagenet 0 - shot	Datacomp (Avg. 38)	MTEB
`nomic-embed-vision-v1.5`	71.0	56.8	62.28
`nomic-embed-vision-v1`	70.7	56.7	62.39
OpenAI CLIP ViT B/16	68.3	56.3	43.82
Jina CLIP v1	59.1	52.2	60.1

📦 Installation

No specific installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

The easiest way to get started with Nomic Embed is through the Nomic Embedding API. Generating embeddings with the nomic Python client is as easy as:

from nomic import embed
import numpy as np

output = embed.image(
    images=[
        "image_path_1.jpeg",
        "image_path_2.png",
    ],
    model='nomic-embed-vision-v1.5',
)

print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)

For more information, see the API reference

Advanced Usage

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import requests

processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(image, return_tensors="pt")

img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)

Multimodal Retrieval

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']

tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1.5')
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
text_model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = text_model(**encoded_input)

text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
text_embeddings = F.layer_norm(text_embeddings, normalized_shape=(text_embeddings.shape[1],))
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)

print(torch.matmul(img_embeddings, text_embeddings.T))

📚 Documentation

Remember nomic-embed-text requires prefixes and so, when using Nomic Embed in multimodal RAG scenarios (e.g. text to image retrieval), you should use the search_query: prefix.

🔧 Technical Details

We align our vision embedder to the text embedding by employing a technique similar to LiT but instead lock the text embedder!

For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding blog post

Training code is released in the contrastors repository

📄 License

This project is licensed under the apache-2.0 license.

🌐 Data Visualization

Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!

👥 Join the Nomic Community

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご