CodeBERTa-small-v1 Open-source Code Understanding Model - Supports Multilingual and Efficient Code Task Processing

Codeberta Small V1

Developed by claudios

CodeBERTa is a code understanding model based on the RoBERTa architecture, specifically trained for multiple programming languages, capable of efficiently handling code-related tasks.

Large Language Model

Transformers

Other#Code Completion #Multilingual Code Understanding #Masked Language Modeling

Downloads 16

Release Time : 5/28/2024

Model Overview

CodeBERTa is a RoBERTa-like model trained on GitHub's CodeSearchNet dataset, focusing on code understanding and generation tasks.

Model Features

Efficient Code Tokenization

Byte-level BPE tokenizer optimized for code corpora, reducing sequence length by 33%-50% compared to natural language tokenizers.

Multilingual Support

Supports 6 major programming languages: Go, Java, JavaScript, PHP, Python, and Ruby.

Lightweight Architecture

6-layer Transformer structure with 84 million parameters, comparable to DistilBERT.

Model Capabilities

Code Completion

Code Understanding

Programming Language Identification

Code Mask Prediction

Use Cases

Code-Assisted Development

PHP Method Completion

Automatically completes method declarations in PHP code

Accurately predicts 'function' as the most likely completion.

Python Type Hint Completion

Automatically completes type hints in Python code

Predicts contextually relevant completions like 'framework'.

Programming Education

Code Example Generation

Generates code examples for specific programming languages

🚀 CodeBERTa

This is an unofficial reupload of a model in the SafeTensors format to prevent older relevant baselines from becoming stale due to HuggingFace changes.

This is an unofficial reupload of huggingface/CodeBERTa-small-v1 in the SafeTensors format using transformers 4.41.1. The goal of this reupload is to prevent older models that are still relevant baselines from becoming stale as a result of changes in HuggingFace. Additionally, minor corrections, such as model max length configuration, may be included.

Original model card below:

🚀 Quick Start

Masked Language Modeling Prediction

Basic Usage

PHP_CODE = """
public static <mask> set(string $key, $value) {
	if (!in_array($key, self::$allowedKeys)) {
		throw new \InvalidArgumentException('Invalid key given');
	}
	self::$storedValues[$key] = $value;
}
""".lstrip()

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="huggingface/CodeBERTa-small-v1",
    tokenizer="huggingface/CodeBERTa-small-v1"
)

fill_mask(PHP_CODE)

## Top 5 predictions:
# 
' function' # prob 0.9999827146530151
'function'  # 
' void'     # 
' def'      # 
' final'    #

Advanced Usage

PYTHON_CODE = """
def pipeline(
    task: str,
    model: Optional = None,
    framework: Optional[<mask>] = None,
    **kwargs
) -> Pipeline:
	pass
""".lstrip()

fill_mask(PYTHON_CODE)

Results:
'framework', 'Framework', ' framework', 'None', 'str'

fill_mask("My name is <mask>.")

# {'sequence': '<s> My name is undefined.</s>', 'score': 0.2548016905784607, 'token': 3353}
# {'sequence': '<s> My name is required.</s>', 'score': 0.07290805131196976, 'token': 2371}
# {'sequence': '<s> My name is null.</s>', 'score': 0.06323737651109695, 'token': 469}
# {'sequence': '<s> My name is name.</s>', 'score': 0.021919190883636475, 'token': 652}
# {'sequence': '<s> My name is disabled.</s>', 'score': 0.019681859761476517, 'token': 7434}

Downstream Task: Programming Language Identification

See the model card for huggingface/CodeBERTa-language-id.

✨ Features

CodeBERTa is a RoBERTa - like model trained on the CodeSearchNet dataset from GitHub.
Supported languages:

"go"
"java"
"javascript"
"php"
"python"
"ruby"

Tokenizer: A Byte - level BPE tokenizer trained on the corpus using Hugging Face tokenizers. It encodes the code corpus efficiently, with sequences 33% to 50% shorter compared to the same corpus tokenized by gpt2/roberta.
Model: A 6 - layer, 84M parameters, RoBERTa - like Transformer model (same number of layers & heads as DistilBERT), initialized from default settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.

Tensorboard for this training ⤵️

📚 Documentation

Information Table

Property	Details
Model Type	A RoBERTa - like Transformer model
Training Data	CodeSearchNet dataset

📄 License

No license information provided in the original document, so this section is skipped.

🔧 Technical Details

CodeBERTa is designed to handle code - related tasks effectively. It is trained on a large code corpus, which enables it to understand the syntactic and semantic patterns in programming languages. The Byte - level BPE tokenizer helps in efficient encoding of the code, and the 6 - layer Transformer architecture allows it to learn complex relationships in the code data.

📄 CodeSearchNet Citation

@article{husain_codesearchnet_2019,
	title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
	shorttitle = {{CodeSearchNet} {Challenge}},
	url = {http://arxiv.org/abs/1909.09436},
	urldate = {2020-03-12},
	journal = {arXiv:1909.09436 [cs, stat]},
	author = {Husain, Hamel and Wu, Ho - Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
	month = sep,
	year = {2019},
	note = {arXiv: 1909.09436},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご