Code_trans_t5_base Open-source Code Documentation Generation Model - Freely Generate Descriptive Documentation for Python Functions

Code Trans T5 Base Code Documentation Generation Python Multitask Finetune

Developed by SEBIS

A Python code documentation generation model based on T5 architecture, pre-trained and fine-tuned with multi-task learning, specifically designed for generating Python function documentation

Text Generation #Python function documentation generation #Multi-task pre-training #Code semantic understanding

Downloads 26

Release Time : 3/2/2022

Model Overview

This model is a code documentation generation model pre-trained for the Python programming language based on the T5-base architecture, capable of automatically generating descriptive documentation for Python functions/methods

Model Features

Multi-task pre-training

Pre-trained on 13 supervised tasks and 7 unsupervised datasets, enhancing the model's generalization capability

Python-specific optimization

Specifically trained and fine-tuned for Python code, excelling in Python function documentation generation tasks

Supports token processing

Works best with tokenized Python code while also supporting raw code processing

Model Capabilities

Automatic Python function documentation generation

Code understanding and summarization

Multi-task learning capability

Use Cases

Software development

Automatic API documentation generation

Automatically generates descriptive documentation for functions in Python libraries

BLEU score 20.39 (Python documentation generation task)

Code understanding assistant tool

Helps developers quickly understand the functionality of complex functions

🚀 CodeTrans Model for Code Documentation Generation in Python

A pre-trained model for generating code documentation in Python, leveraging the T5 base model architecture.

This model is pre-trained on the Python programming language using the t5-base model architecture. It was first introduced in this repository. It is trained on tokenized Python code functions, and performs best when dealing with such tokenized functions.

✨ Features

Model Description

This CodeTrans model is built upon the t5-base model. It comes with its own SentencePiece vocabulary model. It underwent multi-task training on 13 supervised tasks in the software development domain and 7 unsupervised datasets. Subsequently, it was fine-tuned for the code documentation generation task for Python functions/methods.

Intended Uses & Limitations

The model can be used to generate descriptions for Python functions or be fine-tuned for other Python code-related tasks. It can handle unparsed and untokenized Python code. However, its performance is expected to be better when the Python code is tokenized.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to generate Python function documentation using the Transformers SummarizationPipeline:

from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline

pipeline = SummarizationPipeline(
    model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_python_multitask_finetune"),
    tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_code_documentation_generation_python_multitask_finetune", skip_special_tokens=True),
    device=0
)

tokenized_code = "def e ( message , exit_code = None ) : print_log ( message , YELLOW , BOLD ) if exit_code is not None : sys . exit ( exit_code )"
pipeline([tokenized_code])

You can run this example in colab notebook.

📚 Documentation

Training Data

The supervised training tasks datasets can be downloaded from Link

Training Procedure

Multi-task Pretraining

The model was trained on a single TPU Pod V3 - 8 for a total of half a million steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used was AdaFactor with an inverse square root learning rate schedule for pre - training.

Fine - tuning

This model was then fine - tuned on a single TPU Pod V2 - 8 for 4000 steps in total, using a sequence length of 512 (batch size 256), using only the dataset containing Python code.

Evaluation Results

For the code documentation tasks, different models achieve the following results on different programming languages (in BLEU score):

Test results:

Language / Model	Python	Java	Go	Php	Ruby	JavaScript
CodeTrans - ST - Small	17.31	16.65	16.89	23.05	9.19	13.7
CodeTrans - ST - Base	16.86	17.17	17.16	22.98	8.23	13.17
CodeTrans - TF - Small	19.93	19.48	18.88	25.35	13.15	17.23
CodeTrans - TF - Base	20.26	20.19	19.50	25.84	14.07	18.25
CodeTrans - TF - Large	20.35	20.06	19.54	26.18	14.94	18.98
CodeTrans - MT - Small	19.64	19.00	19.15	24.68	14.91	15.26
CodeTrans - MT - Base	20.39	21.22	19.43	26.23	15.26	16.11
CodeTrans - MT - Large	20.18	21.87	19.38	26.08	15.00	16.23
CodeTrans - MT - TF - Small	19.77	20.04	19.36	25.55	13.70	17.24
CodeTrans - MT - TF - Base	19.77	21.12	18.86	25.79	14.24	18.62
CodeTrans - MT - TF - Large	18.94	21.42	18.77	26.20	14.19	18.83
State of the art	19.06	17.65	18.07	25.16	12.16	14.90

📄 License

No license information is provided in the original document, so this section is skipped.

Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご