🚀 ModularStarEncoder-1B Pre-trained model
ModularStarEncoder-1B is an encoder pre-trained on The Stack v2. It's a modular pre - trained encoder with five exit points, enabling users to conduct multiple exit fine - tuning according to downstream tasks. Built on StarCoder-2, its parameter size is reduced from 15B to 1B in bfloat16.
✨ Features
- Architecture: The architecture has 36 hidden layers, each with 16 attention heads and 4 key - value heads, using Grouped Query Attention (GQA). It employs Rotary Positional Encoding (RoPE) with a base period theta = 10^-6, a hidden dimensionality of 1024, and an intermediate size of 12,288.
- Attention Mechanism: Replaced causal self - attention with bidirectional self - attention. Chose full attention over the sliding window attention used in StarCoder-2 for greater modularity and to avoid receptive field constraints.
- Input Length: Extended the maximum input length to 2048 tokens, accommodating longer code snippets compared to previous code encoders like StarEncoder.
- Inference Efficiency: Integrated FlashAttention V2 for faster inference.
- Languages: Supports over 600 programming languages.
- Related Paper: One Model to Train them All: Hierarchical Self - Distillation for Enhanced Early Layer Embeddings
📦 Installation
This section is skipped as no specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoModel
from transformers import AutoTokenizer
model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder")
code_snippet = "your code to embed here"
sentence = f"{tokenizer.sep_token}{code_snippet}{tokenizer.cls_token}"
tokenized_sensence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048)
embedded_sentence = model(**tokenized_sensence)
Output Explanation
The output consists of six elements:
last_hidden_state
: The representation of the last hidden state from the model.
hidden_states
: Raw representation from all the hidden states of the model, without pooling, normalization, and projection.
loss
: Loss value if a ground truth is given (None if used in inference).
prediction_logits
: Prediction scores from masked language modeling head.
seq_relationship_scores
: Prediction scores of in - context loss (concatenate multiple samples with the separator token if you want a meaningful score).
attentions
: Attention scores from the encoder
📚 Documentation
Training Details
We pre - trained ModularStarEncoder with a batch size of 3.99M tokens for 245,000 training steps, processing 1T tokens. The pre - training and fine - tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the Leonardo supercomputer, requiring 450,000 GPU working hours.
Property |
Details |
Hidden size |
1024 |
Max. position embeddings |
2048 |
Num. of attention heads |
12 |
Num. of key values heads |
4 |
Num. of hidden layers |
36 |
Attention |
GQA |
Num. of parameters |
≈1B |
Training tokens |
≈1T |
Loss function |
MLM + In - Context loss |
Multi - layer loss |
yes |
📄 License
The model is licensed under the BigCode OpenRAIL - M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model - license - agreement).
📚 Citation
@article{gurioli2025modeltrainallhierarchical,
title={One Model to Train them All: Hierarchical Self - Distillation for Enhanced Early Layer Embeddings},
author={Andrea Gurioli and Federico Pennino and João Monteiro and Maurizio Gabbrielli},
year={2025},
eprint={2503.03008},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.03008},
}
For the finetuned version on code - to - code and text - to - code, see [ModularStarEncoder - finetuned](https://huggingface.co/modularStarEncoder/ModularStarEncoder - finetuned).