🚀 SentenceTransformer based on Shuu12121/CodeModernBERT-Snake🐍
This model is a sentence-transformers model fine-tuned from Shuu12121/CodeModernBERT-Snake, a ModernBERT model specifically designed for code and pre-trained from scratch by the author. It is tailored for code search and can efficiently calculate the semantic similarity between code snippets and documentation. One of its key features is the maximum sequence length of 8192 tokens, enabling it to handle extremely long code snippets and documentation, which makes it highly suitable for comprehensive code search tasks. Despite being a relatively small model with about 75 million parameters, it demonstrates remarkable performance in code search tasks.
🚀 Quick Start
This SentenceTransformer model is designed to excel in code search tasks. You can quickly start using it by following the installation and inference steps below.
✨ Features
- Fine-tuned for Code Search: Specifically optimized for calculating semantic similarity in code search scenarios.
- Long Sequence Handling: Supports a maximum sequence length of 8192 tokens, suitable for long code snippets and documentation.
- High Performance with Small Size: Achieves competitive results on the CodeSearchNet benchmark despite having only about 75 million parameters.
📦 Installation
To install Sentence Transformers, run the following command:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Snake")
sentences = [
'Encrypts the zip file',
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 Documentation
Model Evaluation
This model achieved an impressive 72.12 on the CodeSearchNet benchmark despite its small size. This performance is comparable to the Salesforce/SFR-Embedding-Code-400M_R model, which has 400 million parameters. Since this model focuses on code search, it does not support other tasks, and evaluation scores for other tasks are not provided. The following table shows a comparison with well-known models, demonstrating that this model achieves a high score despite its compact size.
Model Name |
CodeSearchNet Score |
Shuu12121/CodeModernBERT-Snake |
72.12 |
Salesforce/SFR-Embedding-Code-2B_R |
73.5 |
CodeSage-large-v2 |
94.26 |
Salesforce/SFR-Embedding-Code-400M_R |
72.53 |
CodeSage-large |
90.58 |
Voyage-Code-002 |
81.79 |
E5-Mistral |
54.25 |
E5-Base-v2 |
67.99 |
OpenAI-Ada-002 |
74.21 |
BGE-Base-en-v1.5 |
69.6 |
BGE-M3 |
43.23 |
UniXcoder |
60.2 |
GTE-Base-en-v1.5 |
43.35 |
Contriever |
34.72 |
Model Details
Property |
Details |
Model Type |
Sentence Transformer |
Base Model |
Shuu12121/CodeModernBERT-Snake |
Maximum Sequence Length |
8192 tokens |
Output Dimensions |
512 dimensions |
Similarity Function |
Cosine Similarity |
License |
Apache-2.0 |
Library Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.50.0
- PyTorch: 2.6.0+cu124
- Accelerate: 1.5.2
- Datasets: 3.4.1
- Tokenizers: 0.21.1
📄 License
This model is released under the Apache-2.0 license.
📚 Citation
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}