🚀 CodeTrans model for source code summarization in Python
A pre - trained model for Python source code summarization using the T5 large model architecture.
This is a pre - trained model for the programming language Python, utilizing the T5 large model architecture. It was first released in this repository. This model is trained on tokenized Python code functions, and it performs best when dealing with tokenized Python functions.
✨ Features
Model description
This CodeTrans model is based on the t5 - large
model. It has its own SentencePiece vocabulary model. It employed transfer - learning pre - training on 7 unsupervised datasets in the software development domain and was then fine - tuned on the source code summarization task for Python code snippets.
Intended uses & limitations
The model can be used to generate descriptions for Python functions or be fine - tuned for other Python code tasks. It can handle unparsed and untokenized Python code, but its performance is better with tokenized Python code.
📦 Installation
Not provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
Here is how to use this model to generate Python function documentation using Transformers SummarizationPipeline:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline
pipeline = SummarizationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_python_transfer_learning_finetune"),
tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_python_transfer_learning_finetune", skip_special_tokens=True),
device=0
)
tokenized_code = '''with open ( CODE_STRING , CODE_STRING ) as in_file : buf = in_file . readlines ( ) with open ( CODE_STRING , CODE_STRING ) as out_file : for line in buf : if line == " ; Include this text " : line = line + " Include below " out_file . write ( line ) '''
pipeline([tokenized_code])
Run this example in colab notebook.
📚 Documentation
Training data
The supervised training tasks datasets can be downloaded on Link
Training procedure
Transfer - learning Pretraining
The model was trained on a single TPU Pod V3 - 8 for 240,000 steps in total, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.
Fine - tuning
This model was then fine - tuned on a single TPU Pod V2 - 8 for 100 steps in total, using a sequence length of 512 (batch size 256), using only the dataset containing Python code.
Evaluation results
For the source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):
Test results:
Language / Model |
Python |
SQL |
C# |
CodeTrans - ST - Small |
8.45 |
17.55 |
19.74 |
CodeTrans - ST - Base |
9.12 |
15.00 |
18.65 |
CodeTrans - TF - Small |
10.06 |
17.71 |
20.40 |
CodeTrans - TF - Base |
10.94 |
17.66 |
21.12 |
CodeTrans - TF - Large |
12.41 |
18.40 |
21.43 |
CodeTrans - MT - Small |
13.11 |
19.15 |
22.39 |
CodeTrans - MT - Base |
13.37 |
19.24 |
23.20 |
CodeTrans - MT - Large |
13.24 |
19.40 |
23.57 |
CodeTrans - MT - TF - Small |
12.10 |
18.25 |
22.03 |
CodeTrans - MT - TF - Base |
10.64 |
16.91 |
21.40 |
CodeTrans - MT - TF - Large |
12.14 |
19.98 |
21.10 |
CODE - NN |
-- |
18.40 |
20.50 |
Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn