The open-source code_trans_t5_large model - Free deployment to facilitate efficient generation of Python code summaries

Code Trans T5 Large Source Code Summarization Python Transfer Learning Finetune

Developed by SEBIS

Pre-trained model based on the t5-large architecture, specializing in Python code summarization tasks

Text Generation #Python Code Summarization #Transfer Learning Optimization #Large Parameter Model

Downloads 29

Release Time : 3/2/2022

Model Overview

This model is optimized for Python code functions, capable of generating descriptive summaries for Python functions. It supports unparsed and untokenized Python code but performs better with tokenized code.

Model Features

Transfer Learning Pre-training

Pre-trained on 7 unsupervised datasets in the software engineering domain, enhancing the model's code comprehension capabilities

Python Code Optimization

Specifically optimized for Python code, achieving the best performance on tokenized Python functions

Large-scale Training

Completed 240,000 steps of pre-training and 100 steps of fine-tuning on TPU Pods to ensure model performance

Model Capabilities

Python Code Summarization

Code Comprehension

Text Generation

Use Cases

Software Development

Automatic Function Documentation

Automatically generates descriptive documentation for Python functions

BLEU score 13.37 (Python code)

Code Comprehension Assistance

Helps developers quickly understand the functionality of complex code

🚀 CodeTrans model for source code summarization in Python

A pre - trained model for Python source code summarization using the T5 large model architecture.

This is a pre - trained model for the programming language Python, utilizing the T5 large model architecture. It was first released in this repository. This model is trained on tokenized Python code functions, and it performs best when dealing with tokenized Python functions.

✨ Features

Model description

This CodeTrans model is based on the t5 - large model. It has its own SentencePiece vocabulary model. It employed transfer - learning pre - training on 7 unsupervised datasets in the software development domain and was then fine - tuned on the source code summarization task for Python code snippets.

Intended uses & limitations

The model can be used to generate descriptions for Python functions or be fine - tuned for other Python code tasks. It can handle unparsed and untokenized Python code, but its performance is better with tokenized Python code.

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to generate Python function documentation using Transformers SummarizationPipeline:

from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline

pipeline = SummarizationPipeline(
    model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_python_transfer_learning_finetune"),
    tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_python_transfer_learning_finetune", skip_special_tokens=True),
    device=0
)

tokenized_code =  '''with open ( CODE_STRING , CODE_STRING ) as in_file : buf = in_file . readlines ( )  with open ( CODE_STRING , CODE_STRING ) as out_file : for line in buf :          if line ==   " ; Include this text   " :              line = line +   " Include below  "          out_file . write ( line ) '''
pipeline([tokenized_code])

Run this example in colab notebook.

📚 Documentation

Training data

The supervised training tasks datasets can be downloaded on Link

Training procedure

Transfer - learning Pretraining

The model was trained on a single TPU Pod V3 - 8 for 240,000 steps in total, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.

Fine - tuning

This model was then fine - tuned on a single TPU Pod V2 - 8 for 100 steps in total, using a sequence length of 512 (batch size 256), using only the dataset containing Python code.

Evaluation results

For the source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):

Test results:

Language / Model	Python	SQL	C#
CodeTrans - ST - Small	8.45	17.55	19.74
CodeTrans - ST - Base	9.12	15.00	18.65
CodeTrans - TF - Small	10.06	17.71	20.40
CodeTrans - TF - Base	10.94	17.66	21.12
CodeTrans - TF - Large	12.41	18.40	21.43
CodeTrans - MT - Small	13.11	19.15	22.39
CodeTrans - MT - Base	13.37	19.24	23.20
CodeTrans - MT - Large	13.24	19.40	23.57
CodeTrans - MT - TF - Small	12.10	18.25	22.03
CodeTrans - MT - TF - Base	10.64	16.91	21.40
CodeTrans - MT - TF - Large	12.14	19.98	21.10
CODE - NN	--	18.40	20.50

Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご