🚀 CodeTrans model for source code summarization in Python
This is a pre - trained model for the Python programming language, leveraging the T5 base model architecture. It was initially released in this repository. The model is trained on tokenized Python code functions and performs optimally with such tokenized functions.
🚀 Quick Start
The CodeTrans model is designed to generate descriptions for Python functions or can be fine - tuned for other Python code - related tasks. Here's a simple guide on how to use it:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline
pipeline = SummarizationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_multitask"),
tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_multitask", skip_special_tokens=True),
device=0
)
tokenized_code = '''with open ( CODE_STRING , CODE_STRING ) as in_file : buf = in_file . readlines ( ) with open ( CODE_STRING , CODE_STRING ) as out_file : for line in buf : if line == " ; Include this text " : line = line + " Include below " out_file . write ( line ) '''
pipeline([tokenized_code])
You can run this example in colab notebook.
✨ Features
- Based on the
t5 - base
model with its own SentencePiece vocabulary model.
- Utilizes multi - task training on 13 supervised tasks in the software development domain and 7 unsupervised datasets.
- Can be used to generate descriptions for Python functions or fine - tuned on other Python code tasks.
- Works on both unparsed and untokenized Python code, but performs better with tokenized code.
📚 Documentation
Model description
This CodeTrans model is built upon the t5 - base
model. It comes with its unique SentencePiece vocabulary model. The model underwent multi - task training, covering 13 supervised tasks in the software development domain and 7 unsupervised datasets.
Intended uses & limitations
The model can generate descriptions for Python functions or be fine - tuned for other Python code tasks. It can handle unparsed and untokenized Python code, though tokenized code generally yields better performance.
Training data
The supervised training tasks datasets can be downloaded from Link
Training procedure
Multi - task Pretraining
The model was trained on a single TPU Pod V3 - 8 for a total of 260,000 steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.
Evaluation results
For source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):
Language / Model |
Python |
SQL |
C# |
CodeTrans - ST - Small |
8.45 |
17.55 |
19.74 |
CodeTrans - ST - Base |
9.12 |
15.00 |
18.65 |
CodeTrans - TF - Small |
10.06 |
17.71 |
20.40 |
CodeTrans - TF - Base |
10.94 |
17.66 |
21.12 |
CodeTrans - TF - Large |
12.41 |
18.40 |
21.43 |
CodeTrans - MT - Small |
13.11 |
19.15 |
22.39 |
CodeTrans - MT - Base |
13.37 |
19.24 |
23.20 |
CodeTrans - MT - Large |
13.24 |
19.40 |
23.57 |
CodeTrans - MT - TF - Small |
12.10 |
18.25 |
22.03 |
CodeTrans - MT - TF - Base |
10.64 |
16.91 |
21.40 |
CodeTrans - MT - TF - Large |
12.14 |
19.98 |
21.10 |
CODE - NN |
-- |
18.40 |
20.50 |
Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn