đ nomic-embed-vision-v1.5: Expanding the Latent Space
nomic-embed-vision-v1.5
is a high - performing vision embedding model. It shares the same embedding space as nomic-embed-text-v1.5. All Nomic Embed Text models are now multimodal!
đ Quick Start
Blog | Technical Report | AWS SageMaker | Atlas Embedding and Unstructured Data Analytics Platform
⨠Features
Name |
Imagenet 0 - shot |
Datacomp (Avg. 38) |
MTEB |
nomic-embed-vision-v1.5 |
71.0 |
56.8 |
62.28 |
nomic-embed-vision-v1 |
70.7 |
56.7 |
62.39 |
OpenAI CLIP ViT B/16 |
68.3 |
56.3 |
43.82 |
Jina CLIP v1 |
59.1 |
52.2 |
60.1 |
đĻ Installation
No specific installation steps are provided in the original README, so this section is skipped.
đģ Usage Examples
Basic Usage
The easiest way to get started with Nomic Embed is through the Nomic Embedding API. Generating embeddings with the nomic
Python client is as easy as:
from nomic import embed
import numpy as np
output = embed.image(
images=[
"image_path_1.jpeg",
"image_path_2.png",
],
model='nomic-embed-vision-v1.5',
)
print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)
For more information, see the API reference
Advanced Usage
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import requests
processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, return_tensors="pt")
img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
Multimodal Retrieval
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']
tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1.5')
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
text_model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = text_model(**encoded_input)
text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
text_embeddings = F.layer_norm(text_embeddings, normalized_shape=(text_embeddings.shape[1],))
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)
print(torch.matmul(img_embeddings, text_embeddings.T))
đ Documentation
Remember nomic-embed-text
requires prefixes and so, when using Nomic Embed in multimodal RAG scenarios (e.g. text to image retrieval), you should use the search_query:
prefix.
đ§ Technical Details
We align our vision embedder to the text embedding by employing a technique similar to LiT but instead lock the text embedder!
For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding blog post
Training code is released in the contrastors
repository
đ License
This project is licensed under the apache-2.0
license.
đ Data Visualization
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!

đĨ Join the Nomic Community