license: apache-2.0
datasets:
- Shuu12121/rust-codesearch-dataset-open
- Shuu12121/java-codesearch-dataset-open
- code-search-net/code_search_net
- google/code_x_glue_ct_code_to_text
language:
- en
pipeline_tag: sentence-similarity
tags:
- code
- code-search
- retrieval
- sentence-similarity
- bert
- transformers
- deep-learning
- machine-learning
- nlp
- programming
- multi-language
- rust
- python
- java
- javascript
- php
- ruby
- go
🦉CodeModernBERT-Owl
概要 / Overview
🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル
CodeModernBERT-Owl is a pretrained model designed from scratch for code search and code understanding tasks.
Compared to previous versions such as CodeHawks-ModernBERT and CodeMorph-ModernBERT, this model now supports Rust and improves search accuracy in Python, PHP, Java, JavaScript, Go, and Ruby.
🛠 主な特徴 / Key Features
✅ Supports long sequences up to 2048 tokens (compared to Microsoft's 512-token models)
✅ Optimized for code search, code understanding, and code clone detection
✅ Fine-tuned on GitHub open-source repositories (Java, Rust)
✅ Achieves the highest accuracy among the CodeHawks/CodeMorph series
✅ Multi-language support: Python, PHP, Java, JavaScript, Go, Ruby, and Rust
📊 モデルパラメータ / Model Parameters
パラメータ / Parameter |
値 / Value |
vocab_size |
50,004 |
hidden_size |
768 |
num_hidden_layers |
12 |
num_attention_heads |
12 |
intermediate_size |
3,072 |
max_position_embeddings |
2,048 |
type_vocab_size |
2 |
hidden_dropout_prob |
0.1 |
attention_probs_dropout_prob |
0.1 |
local_attention_window |
128 |
rope_theta |
160,000 |
local_attention_rope_theta |
10,000 |
💻 モデルの使用方法 / How to Use
This model can be easily loaded using the Hugging Face Transformers library.
⚠️ Requires transformers >= 4.48.0
🔗 Colab Demo (Replace with "CodeModernBERT-Owl")
モデルのロード / Load the Model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
コード埋め込みの取得 / Get Code Embeddings
import torch
def get_embedding(text, model, tokenizer, device="cuda"):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
🔍 評価結果 / Evaluation Results
データセット / Dataset
📌 Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.
📌 Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.
📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)
言語 / Language |
CodeModernBERT-Owl |
CodeHawks-ModernBERT |
Salesforce CodeT5+ |
Microsoft CodeBERT |
GraphCodeBERT |
Python |
0.8793 |
0.8551 |
0.8266 |
0.5243 |
0.5493 |
Java |
0.8880 |
0.7971 |
0.8867 |
0.3134 |
0.5879 |
JavaScript |
0.8423 |
0.7634 |
0.7628 |
0.2694 |
0.5051 |
PHP |
0.9129 |
0.8578 |
0.9027 |
0.2642 |
0.6225 |
Ruby |
0.8038 |
0.7469 |
0.7568 |
0.3318 |
0.5876 |
Go |
0.9386 |
0.9043 |
0.8117 |
0.3262 |
0.4243 |
✅ Achieves the highest accuracy in all target languages.
✅ Significantly improved Java accuracy using additional fine-tuned GitHub data.
✅ Outperforms previous models, especially in PHP and Go.
📊 Rust (独自データセット) / Rust Performance
指標 / Metric |
CodeModernBERT-Owl |
MRR |
0.7940 |
MAP |
0.7940 |
R-Precision |
0.7173 |
📌 K別評価指標 / Evaluation Metrics by K
K |
Recall@K |
Precision@K |
NDCG@K |
F1@K |
Success Rate@K |
Query Coverage@K |
1 |
0.7173 |
0.7173 |
0.7173 |
0.7173 |
0.7173 |
0.7173 |
5 |
0.8913 |
0.7852 |
0.8118 |
0.8132 |
0.8913 |
0.8913 |
10 |
0.9333 |
0.7908 |
0.8254 |
0.8230 |
0.9333 |
0.9333 |
50 |
0.9887 |
0.7938 |
0.8383 |
0.8288 |
0.9887 |
0.9887 |
100 |
1.0000 |
0.7940 |
0.8401 |
0.8291 |
1.0000 |
1.0000 |
🔁 別のおすすめモデル / Recommended Alternative Models
1. CodeSearch-ModernBERT-Owl🦉 (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)
If you need a model that is more specialized for code search, this model is highly recommended.
コードサーチに特化したモデルが必要な場合はこちらがおすすめです。
2. CodeModernBERT-Snake🐍 (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)
If you need a pretrained model that supports longer sequences or a smaller model size, this model is ideal.
シーケンス長が長い、またはモデルサイズが小さい事前学習済みモデルが必要な場合はこちらをおすすめします。
- Maximum Sequence Length: 8192 tokens
- Smaller Model Size: ~75M parameters
3. CodeSearch-ModernBERT-Snake🐍 (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)
For those looking for a model that combines long sequence length and code search specialization, this model is the best choice.
コードサーチに特化しつつ長いシーケンスを処理できるモデルが欲しい場合にはこちらがおすすめです。
- Maximum Sequence Length: 8192 tokens
- High Code Search Performance
📝 結論 / Conclusion
✅ Top performance in all languages
✅ Rust support successfully added through dataset augmentation
✅ Further performance improvements possible with better datasets
📜 ライセンス / License
📄 Apache-2.0
📧 連絡先 / Contact
📩 For any questions, please contact:
📧 shun0212114@outlook.jp