codet5-large-ntp-py開源代碼模型 - 免費實現Python代碼理解與生成

首頁

Codet5 Large Ntp Py

由Salesforce開發

CodeT5是基於Python語言NTP目標預訓練的大規模編碼器-解碼器模型，專注於代碼理解與生成任務

大型語言模型

Transformers

開源協議:Bsd-3-clause #Python代碼生成 #多編程語言預訓練 #標識符感知

下載量 217

發布時間 : 7/6/2022

模型概述

CodeT5是一個標識符感知的統一預訓練編碼器-解碼器模型，專門設計用於代碼理解和生成任務。本版本是經過NTP(下一詞預測)目標在Python代碼上微調的大規模模型。

模型特點

多階段預訓練

模型經歷了MSP(掩碼跨度預測)和NTP(下一詞預測)兩階段訓練，優化了代碼理解和生成能力

大規模參數

擁有770M參數的大規模模型，能夠處理複雜的代碼生成任務

專注Python語言

專門針對Python代碼進行了優化訓練，在Python代碼生成任務上表現優異

模型能力

代碼自動補全

代碼生成

代碼理解

函數級代碼生成

使用案例

軟件開發輔助

代碼自動補全

根據部分代碼片段自動生成完整函數或方法

在APPS基準測試中表現良好

教育用途

幫助編程學習者理解代碼結構和生成示例代碼

🚀 CodeT5（使用NTP目標在Python上預訓練的大尺寸模型）

CodeT5是一系列用於代碼處理的編碼器 - 解碼器語言模型，旨在解決代碼理解和生成的問題，為代碼相關任務提供了強大的支持。

🚀 快速開始

本模型可以使用 T5ForConditionalGeneration 功能輕鬆加載：

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

✨ 主要特性

CodeT5是一個編碼器 - 解碼器語言模型家族，來自論文 CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation 。本倉庫包含的檢查點為 CodeT5-large-ntp-py (770M)，由論文 CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 引入。

📚 詳細文檔

訓練數據

CodeT5-large-ntp-py 在 CodeSearchNet 的六種編程語言（Ruby/JavaScript/Go/Python/Java/PHP）數據和GCPY（Github Code 的Python部分）數據上進行了預訓練。更多詳細信息請參閱論文的第4.1節。

訓練過程

CodeT5-large-ntp-py 首先在CodeSearchNet上使用掩碼跨度預測（MSP）目標進行了150個週期的預訓練，在GCPY上進行了10個週期的預訓練，然後在GCPY上使用下一個標記預測（NTP）目標進行了另外10個週期的預訓練。更多詳細信息請參閱論文的第4.1節。

評估結果

我們在 APPS 基準測試中對該檢查點進行了評估。更多詳細信息請參閱論文的表5。

倫理考量

本版本僅用於支持學術論文的研究目的。我們的模型、數據集和代碼並非專門為所有下游用途而設計或評估。我們強烈建議用戶在部署此模型之前，評估並解決與準確性、安全性和公平性相關的潛在問題。我們鼓勵用戶考慮人工智能的常見侷限性，遵守適用法律，並在選擇用例時採用最佳實踐，特別是在錯誤或濫用可能對人們的生活、權利或安全產生重大影響的高風險場景中。有關用例的進一步指導，請參閱我們的AUP和AI AUP。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

📄 許可證

本項目採用BSD 3 - 條款許可證。

📚 引用信息

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}