codet5-large-ntp-py开源代码模型 - 免费实现Python代码理解与生成

首页

Codet5 Large Ntp Py

由 Salesforce 开发

CodeT5是基于Python语言NTP目标预训练的大规模编码器-解码器模型，专注于代码理解与生成任务

大型语言模型

Transformers

开源协议:Bsd-3-clause #Python代码生成 #多编程语言预训练 #标识符感知

下载量 217

发布时间 : 7/6/2022

模型简介

CodeT5是一个标识符感知的统一预训练编码器-解码器模型，专门设计用于代码理解和生成任务。本版本是经过NTP(下一词预测)目标在Python代码上微调的大规模模型。

模型特点

多阶段预训练

模型经历了MSP(掩码跨度预测)和NTP(下一词预测)两阶段训练，优化了代码理解和生成能力

大规模参数

拥有770M参数的大规模模型，能够处理复杂的代码生成任务

专注Python语言

专门针对Python代码进行了优化训练，在Python代码生成任务上表现优异

模型能力

代码自动补全

代码生成

代码理解

函数级代码生成

使用案例

软件开发辅助

代码自动补全

根据部分代码片段自动生成完整函数或方法

在APPS基准测试中表现良好

教育用途

帮助编程学习者理解代码结构和生成示例代码

🚀 CodeT5（使用NTP目标在Python上预训练的大尺寸模型）

CodeT5是一系列用于代码处理的编码器 - 解码器语言模型，旨在解决代码理解和生成的问题，为代码相关任务提供了强大的支持。

🚀 快速开始

本模型可以使用 T5ForConditionalGeneration 功能轻松加载：

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

✨ 主要特性

CodeT5是一个编码器 - 解码器语言模型家族，来自论文 CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation 。本仓库包含的检查点为 CodeT5-large-ntp-py (770M)，由论文 CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 引入。

📚 详细文档

训练数据

CodeT5-large-ntp-py 在 CodeSearchNet 的六种编程语言（Ruby/JavaScript/Go/Python/Java/PHP）数据和GCPY（Github Code 的Python部分）数据上进行了预训练。更多详细信息请参阅论文的第4.1节。

训练过程

CodeT5-large-ntp-py 首先在CodeSearchNet上使用掩码跨度预测（MSP）目标进行了150个周期的预训练，在GCPY上进行了10个周期的预训练，然后在GCPY上使用下一个标记预测（NTP）目标进行了另外10个周期的预训练。更多详细信息请参阅论文的第4.1节。

评估结果

我们在 APPS 基准测试中对该检查点进行了评估。更多详细信息请参阅论文的表5。

伦理考量

本版本仅用于支持学术论文的研究目的。我们的模型、数据集和代码并非专门为所有下游用途而设计或评估。我们强烈建议用户在部署此模型之前，评估并解决与准确性、安全性和公平性相关的潜在问题。我们鼓励用户考虑人工智能的常见局限性，遵守适用法律，并在选择用例时采用最佳实践，特别是在错误或滥用可能对人们的生活、权利或安全产生重大影响的高风险场景中。有关用例的进一步指导，请参阅我们的AUP和AI AUP。

💻 使用示例

基础用法

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

📄 许可证

本项目采用BSD 3 - 条款许可证。

📚 引用信息

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}