๐ ๐ฆCodeModernBERT-Owl-3.0
CodeModernBERT-Owl-3.0 is the final pre-trained version of the multilingual long-context encoder model in the CodeModernBERT series. It's optimized for downstream code-related tasks like code search, code summarization, bug repair, and representation learning. This model builds on the pretraining checkpoint CodeModernBERT-Owl-3.0-Pre and is further pre-trained to better capture structural patterns and semantics in source code across multiple programming languages.
๐ Quick Start
This section provides a high - level overview of what the model can do and how it can benefit users in code - related tasks.
โจ Features
- โ
2048 - token context window for long code understanding
- โ
Trained on 11.2M functions in 8 programming languages
- โ
Fine - tuned for downstream usability
- โ
Ideal for code search, semantic embedding, summarization, and cloze - style bug repair
- โ
Multilingual support: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, Rust, and more
๐ง Technical Details
๐ง Architecture
- Base: ModernBERT - style encoder
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Parameters: ~150M
- Pretraining: Masked Language Modeling (MLM)
- Fine - tuning: Domain - specific code tasks
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl-3.0")
model = AutoModel.from_pretrained("Shuu12121/CodeModernBERT-Owl-3.0")
code = "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)"
inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
embeddings = mean_pooling(outputs, inputs['attention_mask'])
Advanced Usage
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="Shuu12121/CodeModernBERT-Owl-3.0", tokenizer="Shuu12121/CodeModernBERT-Owl-3.0")
fill_mask("def square(x): return x * <mask>")
๐ Documentation
๐ MRR Comparison by Language (Mean Pooling)
- The experiment was conducted using the CodeSearchNet test split.
- The candidate pool size was fixed at 100 for all evaluations.
- Evaluation method: Mean Pooling of the model embeddings.
Language |
CodeModernBERT-Owl-3.0 |
CodeT5+ |
GraphCodeBERT |
CodeBERTa-small |
CodeBERT |
Python |
0.8814 |
0.8048 |
0.3496 |
0.6123 |
0.0927 |
Java |
0.8673 |
0.7853 |
0.3299 |
0.4738 |
0.0816 |
JavaScript |
0.8805 |
0.7111 |
0.2581 |
0.3593 |
0.0692 |
PHP |
0.8788 |
0.7893 |
0.2507 |
0.4533 |
0.0623 |
Ruby |
0.8805 |
0.7201 |
0.3186 |
0.4418 |
0.0762 |
Go |
0.8782 |
0.7577 |
0.4453 |
0.5338 |
0.0856 |
โ
CodeModernBERT-Owl-3.0 (Mean Pooling) not only achieves the highest MRR across all languages in the CodeSearchNet test split but also demonstrates remarkable cross - language consistency. This balanced performance makes it particularly suitable for multilingual code search and understanding tasks where uniform quality across programming languages is critical.
๐ Model Information
Property |
Details |
Model Type |
modernbert |
Training Data |
code-search-net/code_search_net, Shuu12121/python-treesitter-filtered-v5, Shuu12121/javascript-treesitter-filtered-v5, Shuu12121/java-treesitter-filtered-v5, Shuu12121/typescript-treesitter-filtered-v5, Shuu12121/php-treesitter-filtered-v5, Shuu12121/go-treesitter-filtered-v5, Shuu12121/ruby-treesitter-filtered-v5, Shuu12121/rust-treesitter-filtered-v5 |
Num Parameters |
150M |
Max Sequence Length |
2048 |
Training Corpus Size |
11,257,713 |
๐ Training Data
- Size: 11,257,713 function - level code snippets
- Extracted using Tree - sitter
- Sources: CodeSearchNet, custom GitHub repositories, and filtered multilingual corpora
- Languages: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, Rust
๐ Related Models
๐ License
Apache License 2.0
๐งโ๐ป Author
Developed by Shuu12121
โ ๏ธ Important Note
Use mean pooling to obtain fixed - length embeddings.
๐ก Usage Tip
You can fine - tune this model further on your own domain - specific tasks using Hugging Face Trainer
or accelerate
.