CodeModernBERT-Owl-3.0 Open Source Code Model - Optimize tasks such as code search, summarization, and error repair

Codemodernbert Owl 3.0

Developed by Shuu12121

CodeModernBERT-Owl-3.0 is the final pre-trained version of the multilingual long context encoder model in the CodeModernBERT series, optimized for downstream code-related tasks such as code search, code summarization, error repair, and representation learning.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual code understanding #Long context encoding #Function-level semantic embedding

Downloads 119

Release Time : 6/20/2025

Model Overview

This model is built on the pre-trained checkpoint CodeModernBERT-Owl-3.0-Pre and further pre-trained to better capture the structural patterns and semantics in the source code of multiple programming languages.

Model Features

Long context window

Supports a context window of 2048 tokens, suitable for understanding long code.

Multilingual support

Trained on 11.2 million functions in 8 programming languages, supporting multilingual code understanding.

Downstream task optimization

Fine-tuned for downstream tasks such as code search, semantic embedding, summarization, and cloze-style error repair.

High performance

Achieved the highest MRR in all languages of the CodeSearchNet test set, demonstrating excellent cross-language consistency.

Model Capabilities

Code search

Code summarization

Error repair

Representation learning

Multilingual code understanding

Use Cases

Code search

Cross-language code search

Use model embeddings for cross-language code search tasks.

On the CodeSearchNet test set, the MRR reaches 0.8814 (Python).

Code summarization

Automatically generate code summaries

Use the model to generate natural language summaries of code snippets.

Error repair

Cloze-style error repair

Use the model's fill-mask function for code error repair.

🚀 🦉CodeModernBERT-Owl-3.0

CodeModernBERT-Owl-3.0 is the final pre-trained version of the multilingual long-context encoder model in the CodeModernBERT series. It's optimized for downstream code-related tasks like code search, code summarization, bug repair, and representation learning. This model builds on the pretraining checkpoint CodeModernBERT-Owl-3.0-Pre and is further pre-trained to better capture structural patterns and semantics in source code across multiple programming languages.

🚀 Quick Start

This section provides a high - level overview of what the model can do and how it can benefit users in code - related tasks.

✨ Features

✅ 2048 - token context window for long code understanding
✅ Trained on 11.2M functions in 8 programming languages
✅ Fine - tuned for downstream usability
✅ Ideal for code search, semantic embedding, summarization, and cloze - style bug repair
✅ Multilingual support: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, Rust, and more

🔧 Technical Details

🧠 Architecture

Base: ModernBERT - style encoder
Hidden size: 768
Layers: 12
Attention heads: 12
Parameters: ~150M
Pretraining: Masked Language Modeling (MLM)
Fine - tuning: Domain - specific code tasks

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl-3.0")
model = AutoModel.from_pretrained("Shuu12121/CodeModernBERT-Owl-3.0")

code = "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)"
inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)

# Mean Pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

embeddings = mean_pooling(outputs, inputs['attention_mask'])

Advanced Usage

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="Shuu12121/CodeModernBERT-Owl-3.0", tokenizer="Shuu12121/CodeModernBERT-Owl-3.0")
fill_mask("def square(x): return x * <mask>")

📚 Documentation

📊 MRR Comparison by Language (Mean Pooling)

The experiment was conducted using the CodeSearchNet test split.
The candidate pool size was fixed at 100 for all evaluations.
Evaluation method: Mean Pooling of the model embeddings.

Language	CodeModernBERT-Owl-3.0	CodeT5+	GraphCodeBERT	CodeBERTa-small	CodeBERT
Python	0.8814	0.8048	0.3496	0.6123	0.0927
Java	0.8673	0.7853	0.3299	0.4738	0.0816
JavaScript	0.8805	0.7111	0.2581	0.3593	0.0692
PHP	0.8788	0.7893	0.2507	0.4533	0.0623
Ruby	0.8805	0.7201	0.3186	0.4418	0.0762
Go	0.8782	0.7577	0.4453	0.5338	0.0856

✅ CodeModernBERT-Owl-3.0 (Mean Pooling) not only achieves the highest MRR across all languages in the CodeSearchNet test split but also demonstrates remarkable cross - language consistency. This balanced performance makes it particularly suitable for multilingual code search and understanding tasks where uniform quality across programming languages is critical.

📋 Model Information

Property	Details
Model Type	modernbert
Training Data	code-search-net/code_search_net, Shuu12121/python-treesitter-filtered-v5, Shuu12121/javascript-treesitter-filtered-v5, Shuu12121/java-treesitter-filtered-v5, Shuu12121/typescript-treesitter-filtered-v5, Shuu12121/php-treesitter-filtered-v5, Shuu12121/go-treesitter-filtered-v5, Shuu12121/ruby-treesitter-filtered-v5, Shuu12121/rust-treesitter-filtered-v5
Num Parameters	150M
Max Sequence Length	2048
Training Corpus Size	11,257,713

📚 Training Data

Size: 11,257,713 function - level code snippets
Extracted using Tree - sitter
Sources: CodeSearchNet, custom GitHub repositories, and filtered multilingual corpora
Languages: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, Rust

🔗 Related Models

✅ CodeModernBERT-Owl-3.0-Pre – pretraining checkpoint

📄 License

Apache License 2.0

🧑‍💻 Author

Developed by Shuu12121

⚠️ Important Note

Use mean pooling to obtain fixed - length embeddings.

💡 Usage Tip

You can fine - tune this model further on your own domain - specific tasks using Hugging Face Trainer or accelerate.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご