đ CodeTrans model for source code summarization in Python
A pre - trained model for summarizing Python source code using the T5 base model architecture.
This is a pre - trained model for the Python programming language, leveraging the t5 - base
model architecture. It was first introduced in this repository. The model is trained on tokenized Python code functions and performs optimally with such input.
đ Quick Start
The CodeTrans model is designed to generate descriptions for Python functions and can be fine - tuned for other Python code - related tasks. It can handle unparsed and untokenized Python code, but tokenized code will yield better performance.
⨠Features
- Based on the
t5 - base
model with its own SentencePiece vocabulary model.
- Trained using single - task training on a Python source code summarization dataset.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Here is how to use this model to generate Python function documentation using Transformers SummarizationPipeline:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline
pipeline = SummarizationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python"),
tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python", skip_special_tokens=True),
device=0
)
tokenized_code = '''with open ( CODE_STRING , CODE_STRING ) as in_file : buf = in_file . readlines ( ) with open ( CODE_STRING , CODE_STRING ) as out_file : for line in buf : if line == " ; Include this text " : line = line + " Include below " out_file . write ( line ) '''
pipeline([tokenized_code])
Run this example in colab notebook.
đ Documentation
Model description
This CodeTrans model is based on the t5 - base
model. It has its own SentencePiece vocabulary model and was trained using single - task training on a Python source code summarization dataset.
Intended uses & limitations
The model can generate descriptions for Python functions or be fine - tuned for other Python code tasks. It can process unparsed and untokenized Python code, but performance improves with tokenized code.
Training data
The supervised training tasks datasets can be downloaded on Link
Evaluation results
For the source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):
Property |
Details |
Model Type |
CodeTrans models for source code summarization in Python |
Training Data |
Can be downloaded from Link |
Test results:
Language / Model |
Python |
SQL |
C# |
CodeTrans - ST - Small |
8.45 |
17.55 |
19.74 |
CodeTrans - ST - Base |
9.12 |
15.00 |
18.65 |
CodeTrans - TF - Small |
10.06 |
17.71 |
20.40 |
CodeTrans - TF - Base |
10.94 |
17.66 |
21.12 |
CodeTrans - TF - Large |
12.41 |
18.40 |
21.43 |
CodeTrans - MT - Small |
13.11 |
19.15 |
22.39 |
CodeTrans - MT - Base |
13.37 |
19.24 |
23.20 |
CodeTrans - MT - Large |
13.24 |
19.40 |
23.57 |
CodeTrans - MT - TF - Small |
12.10 |
18.25 |
22.03 |
CodeTrans - MT - TF - Base |
10.64 |
16.91 |
21.40 |
CodeTrans - MT - TF - Large |
12.14 |
19.98 |
21.10 |
CODE - NN |
-- |
18.40 |
20.50 |
đ§ Technical Details
The model uses the t5 - base
architecture and has its own SentencePiece vocabulary model. It was trained on a single - task for Python source code summarization.
đ License
No license information is provided in the original document.
Created by Ahmed Elnaggar | [LinkedIn](https://www.linkedin.com/in/prof - ahmed - elnaggar/) and Wei Ding | [LinkedIn](https://www.linkedin.com/in/wei - ding - 92561270/)