CodeSearch-ModernBERT-Snake Open Source Code Search Model - Supports Long Sequence Processing, Free to Use!

Codesearch ModernBERT Snake

Developed by Shuu12121

A sentence transformer model specifically designed for code search, based on the ModernBERT architecture, supporting 8192 token long sequence processing

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #Long Code Processing #Code Search Optimization #Efficient Semantic Matching

Downloads 36

Release Time : 3/26/2025

Model Overview

This model focuses on computing semantic similarity between code snippets and documentation, suitable for code search tasks. Fine-tuned from Shuu12121/CodeModernBERT-Snake, it has the capability to handle ultra-long code sequences.

Model Features

Ultra-Long Sequence Processing

Supports sequences up to 8192 tokens in length, capable of handling extremely long code snippets and documents

Efficient Code Search

Optimized for code search, efficiently computes semantic similarity between code and documentation

Compact Model Size

A small model with only 75 million parameters, yet performs comparably to larger models

Model Capabilities

Code Semantic Embedding Generation

Code-Document Similarity Calculation

Long Code Sequence Processing

Use Cases

Code Search and Retrieval

Code Snippet Search

Search for relevant code snippets based on natural language queries

Achieved a score of 72.12 on the CodeSearchNet benchmark

Document-Code Matching

Automatically match code snippets with their corresponding documentation descriptions

🚀 SentenceTransformer based on Shuu12121/CodeModernBERT-Snake🐍

This model is a sentence-transformers model fine-tuned from Shuu12121/CodeModernBERT-Snake, a ModernBERT model specifically designed for code and pre-trained from scratch by the author. It is tailored for code search and can efficiently calculate the semantic similarity between code snippets and documentation. One of its key features is the maximum sequence length of 8192 tokens, enabling it to handle extremely long code snippets and documentation, which makes it highly suitable for comprehensive code search tasks. Despite being a relatively small model with about 75 million parameters, it demonstrates remarkable performance in code search tasks.

🚀 Quick Start

This SentenceTransformer model is designed to excel in code search tasks. You can quickly start using it by following the installation and inference steps below.

✨ Features

Fine-tuned for Code Search: Specifically optimized for calculating semantic similarity in code search scenarios.
Long Sequence Handling: Supports a maximum sequence length of 8192 tokens, suitable for long code snippets and documentation.
High Performance with Small Size: Achieves competitive results on the CodeSearchNet benchmark despite having only about 75 million parameters.

📦 Installation

To install Sentence Transformers, run the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Snake")

# Example sentences for inference
sentences = [
    'Encrypts the zip file',
    'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n    \n    pgp_keys = grok_keys(config)\n    icefile_prefix = "aomi-%s" % \\\n                     os.path.basename(os.path.dirname(opt.secretfile))\n    if opt.icefile_prefix:\n        icefile_prefix = opt.icefile_prefix\n\n    timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n                              datetime.datetime.now().timetuple())\n    ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n    if not encrypt(zip_filename, ice_file, pgp_keys):\n        raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n    return ice_file',
    'def transform(self, sents):\n        \n\n        def convert(tokens):\n            return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n        if self.vocab is None:\n            raise Exception(\n                "Must run .fit() for .fit_transform() before " "calling .transform()."\n            )\n\n        seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n        X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n        return X',
]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)  # Output: [3, 512]

# Calculate similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)  # Output: [3, 3]

📚 Documentation

Model Evaluation

This model achieved an impressive 72.12 on the CodeSearchNet benchmark despite its small size. This performance is comparable to the Salesforce/SFR-Embedding-Code-400M_R model, which has 400 million parameters. Since this model focuses on code search, it does not support other tasks, and evaluation scores for other tasks are not provided. The following table shows a comparison with well-known models, demonstrating that this model achieves a high score despite its compact size.

Model Name	CodeSearchNet Score
Shuu12121/CodeModernBERT-Snake	72.12
Salesforce/SFR-Embedding-Code-2B_R	73.5
CodeSage-large-v2	94.26
Salesforce/SFR-Embedding-Code-400M_R	72.53
CodeSage-large	90.58
Voyage-Code-002	81.79
E5-Mistral	54.25
E5-Base-v2	67.99
OpenAI-Ada-002	74.21
BGE-Base-en-v1.5	69.6
BGE-M3	43.23
UniXcoder	60.2
GTE-Base-en-v1.5	43.35
Contriever	34.72

Model Details

Property	Details
Model Type	Sentence Transformer
Base Model	Shuu12121/CodeModernBERT-Snake
Maximum Sequence Length	8192 tokens
Output Dimensions	512 dimensions
Similarity Function	Cosine Similarity
License	Apache-2.0

Library Versions

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.50.0
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 3.4.1
Tokenizers: 0.21.1

📄 License

This model is released under the Apache-2.0 license.

📚 Citation

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご