ModularStarEncoder Open-Source Code Encoder - Pretrained with big data and super practical with modular design encoding

Modularstarencoder

Developed by modularStarEncoder

A 1-billion parameter code encoder pre-trained on The Stack v2 dataset, featuring modular design and bidirectional self-attention mechanism

Large Language Model

Transformers

Open Source License:Openrail #Modular encoder #Multi-programming language support #Bidirectional attention mechanism

Downloads 147

Release Time : 2/18/2025

Model Overview

A pre-trained encoder specifically designed for code processing, supporting 600+ programming languages with multi-exit modular architecture and 2048-token context length

Model Features

Modular design

Contains five exit points supporting multi-exit fine-tuning for downstream tasks

Efficient architecture

Reduced from StarCoder-2's 15B parameters to 1B, using Grouped Query Attention (GQA) and bidirectional self-attention mechanism

Long context support

Maximum input length extended to 2048 tokens, outperforming previous code encoders

Multi-language support

Supports code processing for 600+ programming languages

Training optimization

Adopts multi-layer loss function with MLM+in-context loss, accelerated by FlashAttention V2

Model Capabilities

Code snippet embedding

Code representation learning

Multi-language code processing

Long sequence code analysis

Use Cases

Code analysis

Code similarity detection

Compare semantic similarity of code snippets through embedding representations

Code search enhancement

Provide high-quality embedding representations for code search engines

Programming assistance

IDE intelligent completion

Serve as underlying model supporting code auto-completion features

🚀 ModularStarEncoder-1B Pre-trained model

ModularStarEncoder-1B is an encoder pre-trained on The Stack v2. It's a modular pre - trained encoder with five exit points, enabling users to conduct multiple exit fine - tuning according to downstream tasks. Built on StarCoder-2, its parameter size is reduced from 15B to 1B in bfloat16.

✨ Features

Architecture: The architecture has 36 hidden layers, each with 16 attention heads and 4 key - value heads, using Grouped Query Attention (GQA). It employs Rotary Positional Encoding (RoPE) with a base period theta = 10^-6, a hidden dimensionality of 1024, and an intermediate size of 12,288.
Attention Mechanism: Replaced causal self - attention with bidirectional self - attention. Chose full attention over the sliding window attention used in StarCoder-2 for greater modularity and to avoid receptive field constraints.
Input Length: Extended the maximum input length to 2048 tokens, accommodating longer code snippets compared to previous code encoders like StarEncoder.
Inference Efficiency: Integrated FlashAttention V2 for faster inference.
Languages: Supports over 600 programming languages.
Related Paper: One Model to Train them All: Hierarchical Self - Distillation for Enhanced Early Layer Embeddings

📦 Installation

This section is skipped as no specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModel
from transformers import AutoTokenizer

#import the model
model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder", trust_remote_code=True)

#import the tokenizer, the tokenizer applies LEFT padding!
tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder")

code_snippet = "your code to embed here"

#You should follow this pattern to embed a snippet of code 
sentence =  f"{tokenizer.sep_token}{code_snippet}{tokenizer.cls_token}"

#Tokenizing your sentence
tokenized_sensence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048)

#Embedding the tokenized sentence
embedded_sentence = model(**tokenized_sensence)

Output Explanation

The output consists of six elements:

last_hidden_state: The representation of the last hidden state from the model.
hidden_states: Raw representation from all the hidden states of the model, without pooling, normalization, and projection.
loss: Loss value if a ground truth is given (None if used in inference).
prediction_logits: Prediction scores from masked language modeling head.
seq_relationship_scores: Prediction scores of in - context loss (concatenate multiple samples with the separator token if you want a meaningful score).
attentions: Attention scores from the encoder

📚 Documentation

Training Details

We pre - trained ModularStarEncoder with a batch size of 3.99M tokens for 245,000 training steps, processing 1T tokens. The pre - training and fine - tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the Leonardo supercomputer, requiring 450,000 GPU working hours.

Property	Details
Hidden size	1024
Max. position embeddings	2048
Num. of attention heads	12
Num. of key values heads	4
Num. of hidden layers	36
Attention	GQA
Num. of parameters	≈1B
Training tokens	≈1T
Loss function	MLM + In - Context loss
Multi - layer loss	yes

📄 License

The model is licensed under the BigCode OpenRAIL - M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model - license - agreement).

📚 Citation

@article{gurioli2025modeltrainallhierarchical,
      title={One Model to Train them All: Hierarchical Self - Distillation for Enhanced Early Layer Embeddings}, 
      author={Andrea Gurioli and Federico Pennino and João Monteiro and Maurizio Gabbrielli},
      year={2025},
      eprint={2503.03008},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.03008}, 
}

For the finetuned version on code - to - code and text - to - code, see [ModularStarEncoder - finetuned](https://huggingface.co/modularStarEncoder/ModularStarEncoder - finetuned).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご