Open-source large code model CodeT5p-770m - Free deployment to support code understanding and generation tasks

Codet5p 770m

Developed by Salesforce

CodeT5+ is an open-source family of large language models for code, featuring an encoder-decoder architecture that supports multiple modes, suitable for a wide range of code understanding and generation tasks.

Large Language Model

Transformers

Open Source License:Bsd-3-clause #Code Large Language Model #Multi-programming Language Support #Code Understanding and Generation

Downloads 4,801

Release Time : 5/13/2023

Model Overview

CodeT5+ is a novel open-source family of large language models for code, featuring an encoder-decoder architecture that flexibly supports multiple modes (including encoder-only, decoder-only, and encoder-decoder), suitable for a wide range of code understanding and generation tasks.

Model Features

Diverse Pretraining Tasks

Learns rich representations from unimodal code data and bimodal code-text data through various pretraining tasks such as span denoising, causal language modeling, contrastive learning, and text-code matching.

Computationally Efficient Pretraining

Adopts an innovative computationally efficient pretraining method by freezing components initialized from existing large language models to efficiently scale model size.

Flexible Support for Multiple Modes

Supports multiple modes including encoder-only, decoder-only, and encoder-decoder, suitable for a wide range of code understanding and generation tasks.

Model Capabilities

Code Understanding

Code Generation

Text-Code Retrieval

Line-level Code Completion

Retrieval-Augmented Code Generation

Use Cases

Code Generation

Function Completion

Automatically completes the function body based on the function signature

In the zero-shot text-to-code generation task on the HumanEval benchmark, InstructCodeT5+ 16B set a new record for open-source models with 35.0% pass@1 and 54.5% pass@10.

Code Understanding

Code Retrieval

Retrieves relevant code snippets based on natural language descriptions

Achieved an average MRR improvement of 3.2 on 8 text-code retrieval tasks.

🚀 CodeT5+ 770M

CodeT5+ is a new family of open code large language models. It can handle various code - related tasks and offers rich features through advanced pretraining and architecture design.

🚀 Quick Start

This model can be easily loaded using the T5ForConditionalGeneration functionality and employs the same tokenizer as original CodeT5.

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Salesforce/codet5p-770m"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():<extra_id_0>", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# ==> print "Hello World"

✨ Features

Flexible Architecture: CodeT5+ has an encoder - decoder architecture that can operate in different modes (i.e. encoder - only, decoder - only, and encoder - decoder) to support a wide range of code understanding and generation tasks.
Diverse Pretraining Tasks: Compared to the original CodeT5 family, CodeT5+ is pretrained with a diverse set of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text - code matching to learn rich representations from both unimodal code data and bimodal code - text data.
Efficient Scaling: It employs a simple yet effective compute - efficient pretraining method to initialize the model components with frozen off - the - shelf LLMs such as CodeGen to efficiently scale up the model (i.e. 2B, 6B, 16B), and adopts a "shallow encoder and deep decoder" architecture.
Instruction - Tuned: It is instruction - tuned to align with natural language instructions (see our InstructCodeT5+ 16B) following Code Alpaca.

📦 Installation

The model can be installed by loading it through the T5ForConditionalGeneration functionality in the transformers library. The required code is shown in the "How to use" section.

💻 Usage Examples

Basic Usage

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Salesforce/codet5p-770m"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():<extra_id_0>", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# ==> print "Hello World"

Advanced Usage

There is no specific advanced usage code provided in the original document. If you want to perform more complex tasks, you can refer to the official documentation of the transformers library and adjust the input parameters according to your needs.

📚 Documentation

Pretraining data

This checkpoint is trained on the stricter permissive subset of the deduplicated version of the [github - code dataset](https://huggingface.co/datasets/codeparrot/github - code). The data is preprocessed by reserving only permissively licensed code ("mit", “apache - 2”, “bsd - 3 - clause”, “bsd - 2 - clause”, “cc0 - 1.0”, “unlicense”, “isc”). Supported languages (9 in total) are as follows: c, c++, c - sharp, go, java, javascript, php, python, ruby.

Training procedure

This checkpoint is trained on the unimodal code data at the first - stage pretraining, which includes a diverse set of pretraining tasks including span denoising and two variants of causal language modeling. Please refer to the paper for more details.

Evaluation results

CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings: zero - shot, finetuning, and instruction - tuning. Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g., 8 text - to - code retrieval tasks (+3.2 avg. MRR), 2 line - level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval - augmented code generation tasks (+5.8 avg. BLEU - 4). In 2 math programming tasks on MathQA - Python and GSM8K - Python, CodeT5+ models of below billion - parameter sizes significantly outperform many LLMs of up to 137B parameters. Particularly, in the zero - shot text - to - code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed - source OpenAI code - cushman - 001 mode. Please refer to the paper for more details.

🔧 Technical Details

CodeT5+ is introduced in the paper: CodeT5+: Open Code Large Language Models for Code Understanding and Generation by [Yue Wang](https://yuewang - cuhk.github.io/)*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli = 1)*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution).

📄 License

This project is licensed under the bsd - 3 - clause license.

BibTeX entry and citation info

@article{wang2023codet5plus,
  title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
  author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
  journal={arXiv preprint},
  year={2023}
}

Ethical Considerations

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご