๐ CodeBERTa
This is an unofficial reupload of a model in the SafeTensors
format to prevent older relevant baselines from becoming stale due to HuggingFace changes.
This is an unofficial reupload of huggingface/CodeBERTa-small-v1 in the SafeTensors
format using transformers
4.41.1
. The goal of this reupload is to prevent older models that are still relevant baselines from becoming stale as a result of changes in HuggingFace. Additionally, minor corrections, such as model max length configuration, may be included.
Original model card below:
๐ Quick Start
Masked Language Modeling Prediction
Basic Usage
PHP_CODE = """
public static <mask> set(string $key, $value) {
if (!in_array($key, self::$allowedKeys)) {
throw new \InvalidArgumentException('Invalid key given');
}
self::$storedValues[$key] = $value;
}
""".lstrip()
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="huggingface/CodeBERTa-small-v1",
tokenizer="huggingface/CodeBERTa-small-v1"
)
fill_mask(PHP_CODE)
' function'
'function'
' void'
' def'
' final'
Advanced Usage
PYTHON_CODE = """
def pipeline(
task: str,
model: Optional = None,
framework: Optional[<mask>] = None,
**kwargs
) -> Pipeline:
pass
""".lstrip()
fill_mask(PYTHON_CODE)
Results:
'framework', 'Framework', ' framework', 'None', 'str'
fill_mask("My name is <mask>.")
See the model card for huggingface/CodeBERTa-language-id
.
โจ Features
- CodeBERTa is a RoBERTa - like model trained on the CodeSearchNet dataset from GitHub.
- Supported languages:
"go"
"java"
"javascript"
"php"
"python"
"ruby"
- Tokenizer: A Byte - level BPE tokenizer trained on the corpus using Hugging Face
tokenizers
. It encodes the code corpus efficiently, with sequences 33% to 50% shorter compared to the same corpus tokenized by gpt2/roberta.
- Model: A 6 - layer, 84M parameters, RoBERTa - like Transformer model (same number of layers & heads as DistilBERT), initialized from default settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.
Tensorboard for this training โคต๏ธ

๐ Documentation
Information Table
Property |
Details |
Model Type |
A RoBERTa - like Transformer model |
Training Data |
CodeSearchNet dataset |
๐ License
No license information provided in the original document, so this section is skipped.
๐ง Technical Details
CodeBERTa is designed to handle code - related tasks effectively. It is trained on a large code corpus, which enables it to understand the syntactic and semantic patterns in programming languages. The Byte - level BPE tokenizer helps in efficient encoding of the code, and the 6 - layer Transformer architecture allows it to learn complex relationships in the code data.
๐ CodeSearchNet Citation
@article{husain_codesearchnet_2019,
title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
shorttitle = {{CodeSearchNet} {Challenge}},
url = {http://arxiv.org/abs/1909.09436},
urldate = {2020-03-12},
journal = {arXiv:1909.09436 [cs, stat]},
author = {Husain, Hamel and Wu, Ho - Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
month = sep,
year = {2019},
note = {arXiv: 1909.09436},
}