🚀 SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉
This model is a sentence-transformers model fine-tuned from Shuu12121/CodeModernBERT-Owl, a ModernBERT model specifically designed for code, pre-trained from scratch. It is tailored for code search and can efficiently calculate the semantic similarity between code snippets and documentation. One of its key features is the maximum sequence length of 2048 tokens, enabling it to handle moderately long code snippets and documentation. Despite having about 150 million parameters, it shows remarkable performance in code search tasks.
🚀 Quick Start
This model is a fine - tuned sentence - transformers model, which can be quickly set up and used for code search and semantic similarity calculation.
✨ Features
- Code - Specific Design: Based on a code - specialized pre - trained model, it is highly suitable for code search tasks.
- Long Sequence Handling: With a maximum sequence length of 2048 tokens, it can process moderately long code snippets and documentation.
- High Performance: Despite its relatively small size (about 150 million parameters), it achieves good results in code search benchmarks.
📦 Installation
To install Sentence Transformers, run the following command:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")
sentences = [
'Encrypts the zip file',
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 Documentation
Model Evaluation
Despite being a relatively small model with around 150M parameters, this model achieved an impressive 76.89 on the CodeSearchNet benchmark, demonstrating its high performance in code search tasks. Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided. In the CodeSearchNet task, this model outperforms many well - known models, as shown in the comparison table below.
Model Name |
CodeSearchNet Score |
Shuu12121/CodeModernBERT-Owl |
76.89 |
Salesforce/SFR-Embedding-Code-2B_R |
73.5 |
CodeSage-large-v2 |
94.26 |
Salesforce/SFR-Embedding-Code-400M_R |
72.53 |
CodeSage-large |
90.58 |
Voyage-Code-002 |
81.79 |
E5-Mistral |
54.25 |
E5-Base-v2 |
67.99 |
OpenAI-Ada-002 |
74.21 |
BGE-Base-en-v1.5 |
69.6 |
BGE-M3 |
43.23 |
UniXcoder |
60.2 |
GTE-Base-en-v1.5 |
43.35 |
Contriever |
34.72 |
Model Details
Property |
Details |
Model Type |
Sentence Transformer |
Base Model |
Shuu12121/CodeModernBERT-Owl |
Maximum Sequence Length |
2048 tokens |
Output Dimensions |
768 dimensions |
Similarity Function |
Cosine Similarity |
License |
Apache - 2.0 |
Library Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.50.0
- PyTorch: 2.6.0+cu124
- Accelerate: 1.5.2
- Datasets: 3.4.1
- Tokenizers: 0.21.1
📄 License
This model is licensed under the Apache - 2.0 license.
📚 Citation
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}