codet5-large-ntp-py Open Source Code Model - Free to Achieve Python Code Comprehension and Generation

Codet5 Large Ntp Py

Developed by Salesforce

CodeT5 is a large-scale encoder-decoder model pre-trained with NTP objectives for Python language, focusing on code understanding and generation tasks

Large Language Model

Transformers

Open Source License:Bsd-3-clause #Python code generation #Multi-programming language pre-training #Identifier-aware

Downloads 217

Release Time : 7/6/2022

Model Overview

CodeT5 is an identifier-aware unified pre-trained encoder-decoder model specifically designed for code understanding and generation tasks. This version is a large-scale model fine-tuned with NTP (Next Token Prediction) objectives on Python code.

Model Features

Multi-stage pre-training

The model underwent MSP (Masked Span Prediction) and NTP (Next Token Prediction) two-phase training, optimizing its code understanding and generation capabilities

Large-scale parameters

A large-scale model with 770M parameters capable of handling complex code generation tasks

Python language specialization

Specially optimized and trained for Python code, excelling in Python code generation tasks

Model Capabilities

Code auto-completion

Code generation

Code understanding

Function-level code generation

Use Cases

Software development assistance

Code auto-completion

Automatically generates complete functions or methods based on partial code snippets

Performs well on the APPS benchmark

Educational purposes

Helps programming learners understand code structure and generate example code

🚀 CodeT5 (large-size model pretrained with NTP objective on Python)

CodeT5 is an encoder-decoder language model family for code, offering powerful capabilities for code understanding and generation.

🚀 Quick Start

This model can be easily loaded using the T5ForConditionalGeneration functionality:

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

✨ Features

CodeT5 is a family of encoder-decoder language models for code from the paper: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi.

The checkpoint included in this repository is denoted as CodeT5-large-ntp-py (770M), which is introduced by the paper: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

📚 Documentation

Model description

Training data

CodeT5-large-ntp-py was pretrained on CodeSearchNet data in six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) and GCPY (the Python split of Github Code) data. See Section 4.1 of the paper for more details.

Training procedure

CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet for 150 epochs and on GCPY for 10 epochs, followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. See Section 4.1 of the paper for more details.

Evaluation results

We evaluated this checkpoint on APPS benchmark. See Table 5 of the paper for more details.

🔧 Technical Details

The model was trained with specific objectives and on particular datasets, as described in the training data and training procedure sections. The evaluation was conducted on the APPS benchmark.

📄 License

This model is released under the BSD 3-Clause license.

⚠️ Important Note

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📖 BibTeX entry and citation info

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご