đ CodeMorph-ModernBERT
CodeMorph-ModernBERT is a pre-trained model specifically designed from scratch for code search and code understanding tasks. It leverages the code-search-net/code_search_net
dataset to enhance the semantic understanding of code. With support for a maximum sequence length of 2048 tokens (compared to the traditional Microsoft models that only support 512 tokens), it excels in Python code search.
đ Quick Start
You can easily load this model using the Hugging Face Transformers library. Note that it only works with Transformers version 4.48.0
or higher.
Load the Model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
Fill-Mask (Code Completion)
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))
Obtain Code Embeddings
import torch
def get_embedding(text, model, tokenizer, device="cuda"):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
⨠Features
- Long Sequence Support: It can handle sequences of up to 2048 tokens, making it suitable for long and complex functions.
- High Code Search Performance: It uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly better search accuracy than previous models.
- Specifically Trained Model: Trained from scratch using the CodeSearchNet dataset, it can deeply understand programming syntax and comments.
đĻ Installation
The installation mainly involves using the Hugging Face Transformers library. Make sure you have installed the library version 4.48.0
or higher. You can install it using the following command:
pip install transformers>=4.48.0
đģ Usage Examples
Basic Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
Advanced Usage
import torch
def get_embedding(text, model, tokenizer, device="cuda"):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
đ Documentation
Model Parameters
The model is designed with the following parameters:
Parameter Name |
Value |
vocab_size |
50000 |
hidden_size |
768 |
num_hidden_layers |
12 |
num_attention_heads |
12 |
intermediate_size |
3072 |
max_position_embeddings |
2048 |
type_vocab_size |
2 |
hidden_dropout_prob |
0.1 |
attention_probs_dropout_prob |
0.1 |
local_attention_window |
128 |
rope_theta |
160000 |
local_attention_rope_theta |
10000 |
Dataset
This model is trained using the code-search-net/code_search_net
dataset, which contains code snippets from multiple programming languages (such as Python, Java, JavaScript, etc.), making it ideal for code search tasks.
Evaluation Results
The model was evaluated using the Python subset of the code_x_glue_ct_code_to_text
dataset. The main evaluation metrics are as follows:
Comparison with Other Models
Here is a comparison between CodeMorph-ModernBERT and other major code search models:
Model |
MRR |
MAP |
R-Precision |
CodeMorph-ModernBERT |
0.8172 |
0.8172 |
0.7501 |
microsoft/graphcodebert-base |
0.5482 |
0.5482 |
0.4458 |
microsoft/codebert-base-mlm |
0.5243 |
0.5243 |
0.4378 |
Salesforce/codet5p-220m-py |
0.7512 |
0.7512 |
0.6617 |
Salesforce/codet5-large-ntp-py |
0.7846 |
0.7846 |
0.7067 |
Shuu12121/CodeMorph-BERT |
0.6851 |
0.6851 |
0.5934 |
Shuu12121/CodeMorph-BERTv2 |
0.6535 |
0.6535 |
0.5543 |
Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test)
The following table summarizes the evaluation results of various code search models using the google/code_x_glue_tc_nl_code_search_adv
dataset (Test). The candidate pool size for all evaluations was set to 100.
Click here for additional experiment code
Model |
MRR |
MAP |
R-Precision |
Shuu12121/CodeMorph-ModernBERT |
0.6107 |
0.6107 |
0.5038 |
Salesforce/codet5p-220m-py |
0.5037 |
0.5037 |
0.3805 |
Salesforce/codet5-large-ntp-py |
0.4872 |
0.4872 |
0.3658 |
microsoft/graphcodebert-base |
0.3844 |
0.3844 |
0.2764 |
microsoft/codebert-base-mlm |
0.3766 |
0.3766 |
0.2683 |
Shuu12121/CodeMorph-BERTv2 |
0.3142 |
0.3142 |
0.2166 |
Shuu12121/CodeMorph-BERT |
0.2978 |
0.2978 |
0.1992 |
CodeMorph-ModernBERT achieves higher search accuracy compared to other CodeBERT and CodeT5 models.
Evaluation Results Across Multiple Languages
CodeMorph-ModernBERT shows high code search performance in multiple languages. The following is a summary of the main evaluation metrics (MRR, MAP, R-Precision) for each language. This experiment was conducted on a sample of 1000 data points. Click here for the notebook
Language |
MRR |
MAP |
R-Precision |
Python |
0.8098 |
0.8098 |
0.7520 |
Java |
0.6437 |
0.6437 |
0.5480 |
JavaScript |
0.5928 |
0.5928 |
0.4880 |
PHP |
0.7512 |
0.7512 |
0.6710 |
Ruby |
0.7188 |
0.7188 |
0.6310 |
Go |
0.5358 |
0.5358 |
0.4320 |
Although Salesforce/codet5p-220m-bimodal generally has higher search accuracy than CodeMorph-ModernBERT:
Language |
MRR |
MAP |
R-Precision |
Python |
0.8322 |
0.8322 |
0.7660 |
Java |
0.8886 |
0.8886 |
0.8390 |
JavaScript |
0.7611 |
0.7611 |
0.6710 |
PHP |
0.8985 |
0.8985 |
0.8530 |
Ruby |
0.7635 |
0.7635 |
0.6740 |
Go |
0.8127 |
0.8127 |
0.7260 |
In the google/code_x_glue_tc_nl_code_search_adv
dataset (Test), CodeMorph-ModernBERT outperforms it:
Model |
MRR |
MAP |
R-Precision |
Shuu12121/CodeMorph-ModernBERT |
0.6107 |
0.6107 |
0.5038 |
Salesforce/codet5p-220m-bimodal |
0.5326 |
0.5326 |
0.4208 |
This indicates that CodeMorph-ModernBERT may be more advantageous for more challenging tasks and generalization in Python.
đ License
This model is provided under the Apache-2.0
license.
Contact Information
If you have any questions about this model, please contact us at the following email address:
shun0212114@outlook.jp