Codemodernbert Owl

Developed by Shuu12121

CodeModernBERT-Owl is a model pre-trained from scratch, specifically designed for code retrieval and code understanding tasks, supporting multiple programming languages and improving retrieval accuracy.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Code Retrieval #High-Precision Code Understanding #Long Sequence Processing

Downloads 285

Release Time : 3/12/2025

Model Overview

A high-precision model optimized for code retrieval, code understanding, and code clone detection, with added support for Rust and improved retrieval accuracy for multiple languages.

Model Features

Long Sequence Processing

Supports sequences up to 2048 tokens, surpassing Microsoft's model limit of 512 tokens

Multilingual Support

Supports multiple programming languages including Python, PHP, Java, JavaScript, Go, Ruby, and Rust

High Accuracy

Achieves the highest accuracy in code retrieval tasks among the CodeHawks/CodeMorph series

Data Augmentation

Improves retrieval accuracy for Java and Rust through fine-tuning on GitHub open-source repositories

Model Capabilities

Code Retrieval

Code Understanding

Code Clone Detection

Multilingual Code Processing

Use Cases

Code Retrieval

Cross-Language Code Search

Search for similar code snippets across different programming languages

Retrieval accuracy exceeds 0.85 in languages like Python and Java

Code Understanding

Code Function Analysis

Understand the functionality and logic of code

license: apache-2.0 datasets:

Shuu12121/rust-codesearch-dataset-open
Shuu12121/java-codesearch-dataset-open
code-search-net/code_search_net
google/code_x_glue_ct_code_to_text language:
en pipeline_tag: sentence-similarity tags:
code
code-search
retrieval
sentence-similarity
bert
transformers
deep-learning
machine-learning
nlp
programming
multi-language
rust
python
java
javascript
php
ruby
go

🦉CodeModernBERT-Owl

概要 / Overview

🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル

CodeModernBERT-Owl is a pretrained model designed from scratch for code search and code understanding tasks.

Compared to previous versions such as CodeHawks-ModernBERT and CodeMorph-ModernBERT, this model now supports Rust and improves search accuracy in Python, PHP, Java, JavaScript, Go, and Ruby.

🛠 主な特徴 / Key Features

✅ Supports long sequences up to 2048 tokens (compared to Microsoft's 512-token models)
✅ Optimized for code search, code understanding, and code clone detection
✅ Fine-tuned on GitHub open-source repositories (Java, Rust)
✅ Achieves the highest accuracy among the CodeHawks/CodeMorph series
✅ Multi-language support: Python, PHP, Java, JavaScript, Go, Ruby, and Rust

📊 モデルパラメータ / Model Parameters

パラメータ / Parameter	値 / Value
vocab_size	50,004
hidden_size	768
num_hidden_layers	12
num_attention_heads	12
intermediate_size	3,072
max_position_embeddings	2,048
type_vocab_size	2
hidden_dropout_prob	0.1
attention_probs_dropout_prob	0.1
local_attention_window	128
rope_theta	160,000
local_attention_rope_theta	10,000

💻 モデルの使用方法 / How to Use

This model can be easily loaded using the Hugging Face Transformers library.

⚠️ Requires transformers >= 4.48.0

🔗 Colab Demo (Replace with "CodeModernBERT-Owl")

モデルのロード / Load the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")

コード埋め込みの取得 / Get Code Embeddings

import torch
def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

🔍 評価結果 / Evaluation Results

データセット / Dataset

📌 Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.
📌 Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.

📈 主要な評価指標の比較（同一シード値）/ Key Evaluation Metrics (Same Seed)

言語 / Language	CodeModernBERT-Owl	CodeHawks-ModernBERT	Salesforce CodeT5+	Microsoft CodeBERT	GraphCodeBERT
Python	0.8793	0.8551	0.8266	0.5243	0.5493
Java	0.8880	0.7971	0.8867	0.3134	0.5879
JavaScript	0.8423	0.7634	0.7628	0.2694	0.5051
PHP	0.9129	0.8578	0.9027	0.2642	0.6225
Ruby	0.8038	0.7469	0.7568	0.3318	0.5876
Go	0.9386	0.9043	0.8117	0.3262	0.4243

✅ Achieves the highest accuracy in all target languages.
✅ Significantly improved Java accuracy using additional fine-tuned GitHub data.
✅ Outperforms previous models, especially in PHP and Go.

📊 Rust (独自データセット) / Rust Performance

指標 / Metric	CodeModernBERT-Owl
MRR	0.7940
MAP	0.7940
R-Precision	0.7173

📌 K別評価指標 / Evaluation Metrics by K

K	Recall@K	Precision@K	NDCG@K	F1@K	Success Rate@K	Query Coverage@K
1	0.7173	0.7173	0.7173	0.7173	0.7173	0.7173
5	0.8913	0.7852	0.8118	0.8132	0.8913	0.8913
10	0.9333	0.7908	0.8254	0.8230	0.9333	0.9333
50	0.9887	0.7938	0.8383	0.8288	0.9887	0.9887
100	1.0000	0.7940	0.8401	0.8291	1.0000	1.0000

🔁 別のおすすめモデル / Recommended Alternative Models

1. CodeSearch-ModernBERT-Owl🦉 (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)

If you need a model that is more specialized for code search, this model is highly recommended.
コードサーチに特化したモデルが必要な場合はこちらがおすすめです。

2. CodeModernBERT-Snake🐍 (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)

If you need a pretrained model that supports longer sequences or a smaller model size, this model is ideal.
シーケンス長が長い、またはモデルサイズが小さい事前学習済みモデルが必要な場合はこちらをおすすめします。

Maximum Sequence Length: 8192 tokens
Smaller Model Size: ~75M parameters

3. CodeSearch-ModernBERT-Snake🐍 (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)

For those looking for a model that combines long sequence length and code search specialization, this model is the best choice.
コードサーチに特化しつつ長いシーケンスを処理できるモデルが欲しい場合にはこちらがおすすめです。

Maximum Sequence Length: 8192 tokens
High Code Search Performance

📝 結論 / Conclusion

✅ Top performance in all languages
✅ Rust support successfully added through dataset augmentation
✅ Further performance improvements possible with better datasets

📜 ライセンス / License

📄 Apache-2.0

📧 連絡先 / Contact

📩 For any questions, please contact:
📧 shun0212114@outlook.jp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご