CodeSearch-ModernBERT-Owl Open-source Code Search Model - Supports Long Sequence Code Retrieval Applications

Codesearch ModernBERT Owl

Developed by Shuu12121

A sentence transformer model specifically designed for code search, based on ModernBERT architecture, supporting sequence lengths of up to 2048 tokens

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #Code search optimization #Long sequence processing #Code-document matching

Downloads 75

Release Time : 3/25/2025

Model Overview

This model is a fine-tuned sentence transformer from CodeModernBERT-Owl, specifically designed for calculating semantic similarity between code snippets and documentation, suitable for code search tasks.

Model Features

Long sequence support

Supports sequences of up to 2048 tokens, capable of handling medium-length code snippets and documentation

Efficient code search

Specially optimized for code search tasks, efficiently calculating semantic similarity between code and documentation

Lightweight and high-performance

A compact model with only about 150 million parameters, yet excels in code search tasks

Model Capabilities

Code semantic understanding

Docstring similarity calculation

Code search

Use Cases

Code search

Code snippet search

Search for relevant code snippets based on natural language queries

Achieved a score of 76.89 on the CodeSearchNet benchmark

Document-code matching

Matching documentation descriptions with implementation code for verification

🚀 SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉

This model is a sentence-transformers model fine-tuned from Shuu12121/CodeModernBERT-Owl, a ModernBERT model specifically designed for code, pre-trained from scratch. It is tailored for code search and can efficiently calculate the semantic similarity between code snippets and documentation. One of its key features is the maximum sequence length of 2048 tokens, enabling it to handle moderately long code snippets and documentation. Despite having about 150 million parameters, it shows remarkable performance in code search tasks.

🚀 Quick Start

This model is a fine - tuned sentence - transformers model, which can be quickly set up and used for code search and semantic similarity calculation.

✨ Features

Code - Specific Design: Based on a code - specialized pre - trained model, it is highly suitable for code search tasks.
Long Sequence Handling: With a maximum sequence length of 2048 tokens, it can process moderately long code snippets and documentation.
High Performance: Despite its relatively small size (about 150 million parameters), it achieves good results in code search benchmarks.

📦 Installation

To install Sentence Transformers, run the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")

# Example sentences for inference
sentences = [
    'Encrypts the zip file',
    'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n    \n    pgp_keys = grok_keys(config)\n    icefile_prefix = "aomi-%s" % \\\n                     os.path.basename(os.path.dirname(opt.secretfile))\n    if opt.icefile_prefix:\n        icefile_prefix = opt.icefile_prefix\n\n    timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n                              datetime.datetime.now().timetuple())\n    ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n    if not encrypt(zip_filename, ice_file, pgp_keys):\n        raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n    return ice_file',
    'def transform(self, sents):\n        \n\n        def convert(tokens):\n            return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n        if self.vocab is None:\n            raise Exception(\n                "Must run .fit() for .fit_transform() before " "calling .transform()."\n            )\n\n        seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n        X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n        return X',
]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)  # Output: [3, 768]

# Calculate similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)  # Output: [3, 3]

📚 Documentation

Model Evaluation

Despite being a relatively small model with around 150M parameters, this model achieved an impressive 76.89 on the CodeSearchNet benchmark, demonstrating its high performance in code search tasks. Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided. In the CodeSearchNet task, this model outperforms many well - known models, as shown in the comparison table below.

Model Name	CodeSearchNet Score
Shuu12121/CodeModernBERT-Owl	76.89
Salesforce/SFR-Embedding-Code-2B_R	73.5
CodeSage-large-v2	94.26
Salesforce/SFR-Embedding-Code-400M_R	72.53
CodeSage-large	90.58
Voyage-Code-002	81.79
E5-Mistral	54.25
E5-Base-v2	67.99
OpenAI-Ada-002	74.21
BGE-Base-en-v1.5	69.6
BGE-M3	43.23
UniXcoder	60.2
GTE-Base-en-v1.5	43.35
Contriever	34.72

Model Details

Property	Details
Model Type	Sentence Transformer
Base Model	Shuu12121/CodeModernBERT-Owl
Maximum Sequence Length	2048 tokens
Output Dimensions	768 dimensions
Similarity Function	Cosine Similarity
License	Apache - 2.0

Library Versions

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.50.0
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 3.4.1
Tokenizers: 0.21.1

📄 License

This model is licensed under the Apache - 2.0 license.

📚 Citation

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご