🚀 CodeTrans Model for Code Documentation Generation in Python
A pre-trained model for generating code documentation in Python, leveraging the T5 base model architecture.
This model is pre-trained on the Python programming language using the t5-base
model architecture. It was first introduced in this repository. It is trained on tokenized Python code functions, and performs best when dealing with such tokenized functions.
✨ Features
Model Description
This CodeTrans model is built upon the t5-base
model. It comes with its own SentencePiece vocabulary model. It underwent multi-task training on 13 supervised tasks in the software development domain and 7 unsupervised datasets. Subsequently, it was fine-tuned for the code documentation generation task for Python functions/methods.
Intended Uses & Limitations
The model can be used to generate descriptions for Python functions or be fine-tuned for other Python code-related tasks. It can handle unparsed and untokenized Python code. However, its performance is expected to be better when the Python code is tokenized.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
Here is how to use this model to generate Python function documentation using the Transformers SummarizationPipeline:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline
pipeline = SummarizationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_python_multitask_finetune"),
tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_python_multitask_finetune", skip_special_tokens=True),
device=0
)
tokenized_code = "def e ( message , exit_code = None ) : print_log ( message , YELLOW , BOLD ) if exit_code is not None : sys . exit ( exit_code )"
pipeline([tokenized_code])
You can run this example in colab notebook.
📚 Documentation
Training Data
The supervised training tasks datasets can be downloaded from Link
Training Procedure
Multi-task Pretraining
The model was trained on a single TPU Pod V3 - 8 for a total of half a million steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used was AdaFactor with an inverse square root learning rate schedule for pre - training.
Fine - tuning
This model was then fine - tuned on a single TPU Pod V2 - 8 for 4000 steps in total, using a sequence length of 512 (batch size 256), using only the dataset containing Python code.
Evaluation Results
For the code documentation tasks, different models achieve the following results on different programming languages (in BLEU score):
Test results:
Language / Model |
Python |
Java |
Go |
Php |
Ruby |
JavaScript |
CodeTrans - ST - Small |
17.31 |
16.65 |
16.89 |
23.05 |
9.19 |
13.7 |
CodeTrans - ST - Base |
16.86 |
17.17 |
17.16 |
22.98 |
8.23 |
13.17 |
CodeTrans - TF - Small |
19.93 |
19.48 |
18.88 |
25.35 |
13.15 |
17.23 |
CodeTrans - TF - Base |
20.26 |
20.19 |
19.50 |
25.84 |
14.07 |
18.25 |
CodeTrans - TF - Large |
20.35 |
20.06 |
19.54 |
26.18 |
14.94 |
18.98 |
CodeTrans - MT - Small |
19.64 |
19.00 |
19.15 |
24.68 |
14.91 |
15.26 |
CodeTrans - MT - Base |
20.39 |
21.22 |
19.43 |
26.23 |
15.26 |
16.11 |
CodeTrans - MT - Large |
20.18 |
21.87 |
19.38 |
26.08 |
15.00 |
16.23 |
CodeTrans - MT - TF - Small |
19.77 |
20.04 |
19.36 |
25.55 |
13.70 |
17.24 |
CodeTrans - MT - TF - Base |
19.77 |
21.12 |
18.86 |
25.79 |
14.24 |
18.62 |
CodeTrans - MT - TF - Large |
18.94 |
21.42 |
18.77 |
26.20 |
14.19 |
18.83 |
State of the art |
19.06 |
17.65 |
18.07 |
25.16 |
12.16 |
14.90 |
📄 License
No license information is provided in the original document, so this section is skipped.
Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn