Nomic Embed Code GGUF

Developed by nomic-ai

The Nomic code embedding model is a top-tier code retrieval tool that supports multiple programming languages and excels in code retrieval tasks.

Text Embedding Open Source License:Apache-2.0 #Code Semantic Retrieval #Multilingual Code Embedding #High-Precision Quantization

Downloads 1,300

Release Time : 4/30/2025

Model Overview

The Nomic code embedding model is a high-performance code retrieval tool that supports multiple programming languages, including Python, Java, Ruby, PHP, JavaScript, and Go. Optimized through quantization technology, it is suitable for code retrieval and feature extraction tasks.

Model Features

High-Performance Code Retrieval

Outperforms Voyage Code 3 and OpenAI Embed 3 Large on CodeSearchNet, delivering exceptional performance.

Multilingual Support

Supports multiple programming languages, including Python, Java, Ruby, PHP, JavaScript, and Go.

Advanced Architecture

Utilizes a 7B-parameter code embedding model trained with dual consistency filtering and progressive hard negative mining.

Fully Open-Source

Publicly available model weights, training data, and evaluation code for easy research and application.

Model Capabilities

Code Retrieval

Sentence Similarity Calculation

Feature Extraction

Use Cases

Code Retrieval

Code Retrieval in RAG Applications

In Retrieval-Augmented Generation (RAG) applications, this model retrieves code snippets relevant to user queries.

Accurately retrieves code snippets related to queries, such as functions for calculating factorials.

Code Similarity Analysis

Code Similarity Comparison

Compares similarity between different code snippets for clone detection or code recommendation.

Accurately calculates similarity between code snippets, distinguishing functionally different code.

🚀 Llama.cpp Quantizations of Nomic Embed Code: A State-of-the-Art Code Retriever

This project offers Llama.cpp quantizations of the Nomic Embed Code, a top - notch code retriever, enabling efficient code embedding and retrieval.

Blog | Technical Report | AWS SageMaker | Atlas Embedding and Unstructured Data Analytics Platform

Using llama.cpp commit 11683f579 for quantization.

Original model: nomic-embed-code

🚀 Quick Start

This model can be used with the llama.cpp server and other software that supports llama.cpp embedding models.

Queries embedded with nomic-embed-code must begin with the following prefix:

Represent this query for searching relevant code:

✨ Features

High Performance: Outperforms Voyage Code 3 and OpenAI Embed 3 Large on CodeSearchNet.
Multilingual Code Support: Trained for multiple programming languages (Python, Java, Ruby, PHP, JavaScript, Go).
Advanced Architecture: 7B parameter code embedding model.
Fully Open - Source: Model weights, training data, and evaluation code released.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Start a llama.cpp server:

llama-server -m nomic-embed-code.Q4_0.gguf --embeddings --pooling last

Advanced Usage

The following code shows how to use the prefix to embed user questions, e.g., in a RAG application.

import requests
from textwrap import dedent

def dot(va, vb):
    return sum(a*b for a, b in zip(va, vb))
def embed(texts):
    resp = requests.post('http://localhost:8080/v1/embeddings', json={'input': texts}).json()
    return [d['embedding'] for d in resp['data']]

docs = [
    dedent("""\
    def fn(n):
        if n < 0:
            raise ValueError
        return 1 if n == 0 else n * fn(n - 1)
    """).strip(),
    dedent("""\
    def fn(n):
        print(("Fizz" * (n % 3 == 0) + "Buzz" * (n % 5 == 0)) or n)
    """).strip(),
]
docs_embed = embed(docs)

query = 'Calculate the n-th factorial'
query_embed = embed(['Represent this query for searching relevant code: ' + query])[0]
print(f'query: {query!r}')
for d, e in zip(docs, docs_embed):
    print(f'\nsimilarity {dot(query_embed, e):.2f}:\n{d}')

You should see output similar to this:

query: 'Calculate the n-th factorial'

similarity 0.49:
def fn(n):
    if n < 0:
        raise ValueError
    return 1 if n == 0 else n * fn(n - 1)

similarity 0.32:
def fn(n):
    print(("Fizz" * (n % 3 == 0) + "Buzz" * (n % 5 == 0)) or n)

📚 Documentation

Download a file (not the whole branch)

Property	Details
Filename	nomic-embed-code.f32.gguf, nomic-embed-code.f16.gguf, etc.
Quant Type	f32, f16, bf16, Q8_0, Q6_K, etc.
File Size	Ranging from 2.64GiB to 26.35GiB
Description	Various descriptions for different quant types, e.g., Full FP32 weights, Full FP16 weights, etc.

Model Overview

nomic-embed-code is a state - of - the - art code embedding model that excels at code retrieval tasks.

Model	Python	Java	Ruby	PHP	JavaScript	Go
Nomic Embed Code	81.7	80.5	81.8	72.3	77.1	93.8
Voyage Code 3	80.8	80.5	84.6	71.7	79.2	93.2
OpenAI Embed 3 Large	70.8	72.9	75.3	59.6	68.1	87.6
Nomic CodeRankEmbed - 137M	78.4	76.9	79.3	68.8	71.4	92.7
CodeSage Large v2 (1B)	74.2	72.3	76.7	65.2	72.5	84.6
CodeSage Large (1B)	70.8	70.2	71.9	61.3	69.5	83.7
Qodo Embed 1 7B	59.9	61.6	68.4	48.5	57.0	81.4

Model Architecture

Total Parameters: 7B
Training Approach: Trained on the CoRNStack dataset with dual - consistency filtering and progressive hard negative mining.
Supported Languages: Python, Java, Ruby, PHP, JavaScript, and Go.

CoRNStack Dataset Curation

Starting with the deduplicated Stackv2, we create text - code pairs from function docstrings and respective code. We filtered out low - quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long - range dependencies.

image/png

After the initial filtering, we used dual - consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top - 2 most similar examples for a given docstring.

During training, we employ a novel curriculum - based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax - based sampling strategy to progressively sample hard negatives with increasing difficulty over time.

🔧 Technical Details

The model is trained on the CoRNStack dataset. The dataset curation process involves multiple steps of filtering to ensure high - quality text - code pairs. During training, dual - consistency filtering and progressive hard negative mining are used to improve the model's performance on code retrieval tasks.

📄 License

This project is licensed under the apache-2.0 license.

Join the Nomic Community

Nomic Embed Ecosystem: https://www.nomic.ai/embed
Website: https://nomic.ai
Twitter: https://twitter.com/nomic_ai
Discord: https://discord.gg/myY5YDR8z8

Citation

If you find the model, dataset, or training code useful, please cite our work:

@misc{suresh2025cornstackhighqualitycontrastivedata,
      title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
      author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
      year={2025},
      eprint={2412.01007},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.01007},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご