Code_trans_t5_base Open-source Model - Free Deployment for Python Code Summary Generation and Multi-task Processing

Code Trans T5 Base Source Code Summarization Python Multitask

Developed by SEBIS

A pre-trained model based on the T5 architecture, specifically designed for Python code summarization with multi-task support

Large Language Model #Python Code Summarization #Multi-task Pre-training #T5 Architecture Optimization

Downloads 57

Release Time : 3/2/2022

Model Overview

This model generates functional descriptions for Python functions, supporting both raw and tokenized code input, with better performance on tokenized code

Model Features

Multi-task Training Framework

Trained on 13 supervised tasks and 7 unsupervised datasets to enhance model generalization

Tokenization Optimization

Delivers optimal performance with tokenized Python functions while maintaining support for raw code processing

High-performance Architecture

Based on t5-base model with dedicated SentencePiece vocabulary model

Model Capabilities

Python code summarization

Multi-task code processing

Source code analysis

Use Cases

Software Development

Automatic Function Documentation

Automatically generates functional descriptions for Python functions

BLEU score of 13.37 (Python language)

Code Comprehension Assistance

Helps developers quickly understand complex code segments

🚀 CodeTrans model for source code summarization in Python

This is a pre - trained model for the Python programming language, leveraging the T5 base model architecture. It was initially released in this repository. The model is trained on tokenized Python code functions and performs optimally with such tokenized functions.

🚀 Quick Start

The CodeTrans model is designed to generate descriptions for Python functions or can be fine - tuned for other Python code - related tasks. Here's a simple guide on how to use it:

from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline

pipeline = SummarizationPipeline(
    model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_multitask"),
    tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_base_source_code_summarization_python_multitask", skip_special_tokens=True),
    device=0
)

tokenized_code = '''with open ( CODE_STRING , CODE_STRING ) as in_file : buf = in_file . readlines ( )  with open ( CODE_STRING , CODE_STRING ) as out_file : for line in buf :          if line ==   " ; Include this text   " :              line = line +   " Include below  "          out_file . write ( line ) '''
pipeline([tokenized_code])

You can run this example in colab notebook.

✨ Features

Based on the t5 - base model with its own SentencePiece vocabulary model.
Utilizes multi - task training on 13 supervised tasks in the software development domain and 7 unsupervised datasets.
Can be used to generate descriptions for Python functions or fine - tuned on other Python code tasks.
Works on both unparsed and untokenized Python code, but performs better with tokenized code.

📚 Documentation

Model description

This CodeTrans model is built upon the t5 - base model. It comes with its unique SentencePiece vocabulary model. The model underwent multi - task training, covering 13 supervised tasks in the software development domain and 7 unsupervised datasets.

Intended uses & limitations

The model can generate descriptions for Python functions or be fine - tuned for other Python code tasks. It can handle unparsed and untokenized Python code, though tokenized code generally yields better performance.

Training data

The supervised training tasks datasets can be downloaded from Link

Training procedure

Multi - task Pretraining

The model was trained on a single TPU Pod V3 - 8 for a total of 260,000 steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.

Evaluation results

For source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):

Language / Model	Python	SQL	C#
CodeTrans - ST - Small	8.45	17.55	19.74
CodeTrans - ST - Base	9.12	15.00	18.65
CodeTrans - TF - Small	10.06	17.71	20.40
CodeTrans - TF - Base	10.94	17.66	21.12
CodeTrans - TF - Large	12.41	18.40	21.43
CodeTrans - MT - Small	13.11	19.15	22.39
CodeTrans - MT - Base	13.37	19.24	23.20
CodeTrans - MT - Large	13.24	19.40	23.57
CodeTrans - MT - TF - Small	12.10	18.25	22.03
CodeTrans - MT - TF - Base	10.64	16.91	21.40
CodeTrans - MT - TF - Large	12.14	19.98	21.10
CODE - NN	--	18.40	20.50

Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご