CodeMorph-ModernBERT Open Source Code Model - Freely Assist with Code Search and Understanding, Exceptional Performance in Python Search

Codemorph ModernBERT

Developed by Shuu12121

A pre-trained model specifically trained from scratch for code search and code understanding tasks, supporting sequences up to 2048 tokens in length, with outstanding performance in Python code search tasks.

Large Language Model

Safetensors

OtherOpen Source License:Apache-2.0 #Long Sequence Code Understanding #Multilingual Code Search #Code Semantic Embedding

Downloads 110

Release Time : 2/19/2025

Model Overview

Based on the ModernBERT architecture, designed for code search, code understanding, and code completion tasks. Trained on the CodeSearchNet dataset, it deeply understands the relationship between code syntax and comments.

Model Features

Long Sequence Support

Capable of processing sequences up to 2048 tokens, suitable for lengthy code and complex functions.

Exceptional Code Search Performance

Utilizes a SentencePiece tokenizer for 6 programming languages, significantly surpassing previous models in search accuracy.

Specialized Training Model

Trained from scratch on the CodeSearchNet dataset, deeply understanding the relationship between code syntax and comments.

Model Capabilities

Code Search

Code Understanding

Code Completion

Code Semantic Understanding

Use Cases

Code Search

Python Code Search

Search for related functions or code snippets in Python codebases.

Mean Reciprocal Rank (MRR) reached 0.8172

Code Understanding

Code Comment Generation

Generate corresponding comments based on code snippets.

🚀 CodeMorph-ModernBERT

CodeMorph-ModernBERT is a pre-trained model specifically designed from scratch for code search and code understanding tasks. It leverages the code-search-net/code_search_net dataset to enhance the semantic understanding of code. With support for a maximum sequence length of 2048 tokens (compared to the traditional Microsoft models that only support 512 tokens), it excels in Python code search.

🚀 Quick Start

You can easily load this model using the Hugging Face Transformers library. Note that it only works with Transformers version 4.48.0 or higher.

Here is a simple usage example

Load the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")

Fill-Mask (Code Completion)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))

Obtain Code Embeddings

import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

✨ Features

Long Sequence Support: It can handle sequences of up to 2048 tokens, making it suitable for long and complex functions.
High Code Search Performance: It uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly better search accuracy than previous models.
Specifically Trained Model: Trained from scratch using the CodeSearchNet dataset, it can deeply understand programming syntax and comments.

📦 Installation

The installation mainly involves using the Hugging Face Transformers library. Make sure you have installed the library version 4.48.0 or higher. You can install it using the following command:

pip install transformers>=4.48.0

💻 Usage Examples

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")

Advanced Usage

import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

📚 Documentation

Model Parameters

The model is designed with the following parameters:

Parameter Name	Value
vocab_size	50000
hidden_size	768
num_hidden_layers	12
num_attention_heads	12
intermediate_size	3072
max_position_embeddings	2048
type_vocab_size	2
hidden_dropout_prob	0.1
attention_probs_dropout_prob	0.1
local_attention_window	128
rope_theta	160000
local_attention_rope_theta	10000

Dataset

This model is trained using the code-search-net/code_search_net dataset, which contains code snippets from multiple programming languages (such as Python, Java, JavaScript, etc.), making it ideal for code search tasks.

Evaluation Results

The model was evaluated using the Python subset of the code_x_glue_ct_code_to_text dataset. The main evaluation metrics are as follows:

Metric	Score
MRR (Mean Reciprocal Rank)	0.8172
MAP (Mean Average Precision)	0.8172
R-Precision	0.7501
Recall@10	0.9389
Precision@10	0.8143
NDCG@10	0.8445
F1@10	0.8423
Click here for detailed experiment information

Comparison with Other Models

Here is a comparison between CodeMorph-ModernBERT and other major code search models:

Model	MRR	MAP	R-Precision
CodeMorph-ModernBERT	0.8172	0.8172	0.7501
microsoft/graphcodebert-base	0.5482	0.5482	0.4458
microsoft/codebert-base-mlm	0.5243	0.5243	0.4378
Salesforce/codet5p-220m-py	0.7512	0.7512	0.6617
Salesforce/codet5-large-ntp-py	0.7846	0.7846	0.7067
Shuu12121/CodeMorph-BERT	0.6851	0.6851	0.5934
Shuu12121/CodeMorph-BERTv2	0.6535	0.6535	0.5543

Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test)

The following table summarizes the evaluation results of various code search models using the google/code_x_glue_tc_nl_code_search_adv dataset (Test). The candidate pool size for all evaluations was set to 100. Click here for additional experiment code

Model	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-py	0.5037	0.5037	0.3805
Salesforce/codet5-large-ntp-py	0.4872	0.4872	0.3658
microsoft/graphcodebert-base	0.3844	0.3844	0.2764
microsoft/codebert-base-mlm	0.3766	0.3766	0.2683
Shuu12121/CodeMorph-BERTv2	0.3142	0.3142	0.2166
Shuu12121/CodeMorph-BERT	0.2978	0.2978	0.1992

CodeMorph-ModernBERT achieves higher search accuracy compared to other CodeBERT and CodeT5 models.

Evaluation Results Across Multiple Languages

CodeMorph-ModernBERT shows high code search performance in multiple languages. The following is a summary of the main evaluation metrics (MRR, MAP, R-Precision) for each language. This experiment was conducted on a sample of 1000 data points. Click here for the notebook

Language	MRR	MAP	R-Precision
Python	0.8098	0.8098	0.7520
Java	0.6437	0.6437	0.5480
JavaScript	0.5928	0.5928	0.4880
PHP	0.7512	0.7512	0.6710
Ruby	0.7188	0.7188	0.6310
Go	0.5358	0.5358	0.4320

Although Salesforce/codet5p-220m-bimodal generally has higher search accuracy than CodeMorph-ModernBERT:

Language	MRR	MAP	R-Precision
Python	0.8322	0.8322	0.7660
Java	0.8886	0.8886	0.8390
JavaScript	0.7611	0.7611	0.6710
PHP	0.8985	0.8985	0.8530
Ruby	0.7635	0.7635	0.6740
Go	0.8127	0.8127	0.7260

In the google/code_x_glue_tc_nl_code_search_adv dataset (Test), CodeMorph-ModernBERT outperforms it:

Model	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-bimodal	0.5326	0.5326	0.4208

This indicates that CodeMorph-ModernBERT may be more advantageous for more challenging tasks and generalization in Python.

📄 License

This model is provided under the Apache-2.0 license.

Contact Information

If you have any questions about this model, please contact us at the following email address: shun0212114@outlook.jp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご