🚀 CodeTrans Model for Source Code Summarization in C#
A pre - trained model for source code summarization in C# using the T5 large model architecture.
This is a pre - trained model on the programming language C# that utilizes the t5-large
model architecture. It was first released in this repository. This model is trained on tokenized C# code functions, and it performs best when used with tokenized C# functions.
✨ Features
Model description
This CodeTrans model is based on the t5-large
model. It has its own SentencePiece vocabulary model. It employed multi - task training on 13 supervised tasks in the software development domain and 7 unsupervised datasets.
Intended uses & limitations
The model can be used to generate descriptions for C# functions or be fine - tuned for other C# code tasks. It can handle unparsed and untokenized C# code, but its performance is better when the C# code is tokenized.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
Here is how to use this model to generate C# function documentation using Transformers SummarizationPipeline:
from transformers import AutoTokenizer, AutoModelWithLMHead, SummarizationPipeline
pipeline = SummarizationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_csharp_multitask"),
tokenizer=AutoTokenizer.from_pretrained("SEBIS/code_trans_t5_large_source_code_summarization_csharp_multitask", skip_special_tokens=True),
device=0
)
tokenized_code = "public static DateTime ParseUnixDateTime ( double unixTime ) { var dt = new DateTime ( CODE_INTEGER , CODE_INTEGER , CODE_INTEGER , CODE_INTEGER , CODE_INTEGER , CODE_INTEGER , CODE_INTEGER , System . DateTimeKind . Utc ) ; dt = dt . AddSeconds ( unixTimeStamp ) . ToLocalTime ( ) ; return dt ; }"
pipeline([tokenized_code])
You can run this example in colab notebook.
📚 Documentation
Training data
The supervised training tasks datasets can be downloaded from Link
Training procedure
Multi - task Pretraining
The model was trained on a single TPU Pod V3 - 8 for a total of 120,000 steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.
Evaluation results
For the source code summarization tasks, different models achieve the following results on different programming languages (in BLEU score):
Test results:
Language / Model |
Python |
SQL |
C# |
CodeTrans - ST - Small |
8.45 |
17.55 |
19.74 |
CodeTrans - ST - Base |
9.12 |
15.00 |
18.65 |
CodeTrans - TF - Small |
10.06 |
17.71 |
20.40 |
CodeTrans - TF - Base |
10.94 |
17.66 |
21.12 |
CodeTrans - TF - Large |
12.41 |
18.40 |
21.43 |
CodeTrans - MT - Small |
13.11 |
19.15 |
22.39 |
CodeTrans - MT - Base |
13.37 |
19.24 |
23.20 |
CodeTrans - MT - Large |
13.24 |
19.40 |
23.57 |
CodeTrans - MT - TF - Small |
12.10 |
18.25 |
22.03 |
CodeTrans - MT - TF - Base |
10.64 |
16.91 |
21.40 |
CodeTrans - MT - TF - Large |
12.14 |
19.98 |
21.10 |
CODE - NN |
-- |
18.40 |
20.50 |
📄 License
No license information is provided in the original document, so this section is skipped.
Created by Ahmed Elnaggar | LinkedIn and Wei Ding | LinkedIn