starcoder-gpteacher-code-instruct開源模型 - 免費部署提升代碼生成與解釋能力

首頁

Starcoder Gpteacher Code Instruct

由GeorgiaTechResearchInstitute開發

基於StarCoder模型，使用GPTeacher代碼生成數據集進行微調，優化了代碼生成和解釋能力

大型語言模型

Transformers

開源協議:Openrail #代碼指令微調 #多語言代碼生成 #8192長上下文

下載量 122

發布時間 : 5/5/2023

模型概述

本模型是基於155億參數的StarCoder模型，通過GPT-4生成的代碼指令數據進行微調，專注於代碼生成和解釋任務，支持80+種編程語言

模型特點

大上下文窗口

支持8192token的上下文窗口，適合處理長代碼片段

多語言支持

訓練數據涵蓋80+種編程語言，具有廣泛的語言適應性

指令微調優化

使用GPT-4生成的代碼指令數據進行微調，對用戶指令響應更佳

模型能力

代碼生成

代碼解釋

編程問題解答

代碼補全

使用案例

代碼開發輔助

函數生成

根據自然語言描述生成特定功能的代碼函數

能生成符合要求的函數實現

代碼解釋

解釋複雜代碼片段的邏輯和功能

提供清晰準確的代碼解釋

編程教育

編程學習輔助

幫助學習者理解編程概念和代碼實現

提供易於理解的解釋和示例

🚀 StarCoder GPTeacher-Codegen 微調模型

本模型是基於Transformer架構的文本生成模型，它在預訓練模型的基礎上進行了微調，能夠根據給定的指令生成代碼，在代碼生成領域具有較高的實用性。

🚀 快速開始

本模型是在bigcode/starcoder的基礎上，使用teknium1/GPTeacher代碼生成數據集（GPT - 4代碼指令微調）進行微調得到的。

✨ 主要特性

多語言支持：基礎的StarCoder模型是一個具有155億參數的模型，在來自The Stack (v1.2)的80多種編程語言上進行訓練，排除了選擇退出請求的數據。
先進技術應用：模型使用了多查詢注意力機制（Multi Query Attention）、8192個標記的上下文窗口，並在1萬億個標記上使用中間填充目標（Fill - in - the - Middle objective）進行訓練。
信息資源豐富：
- 倉庫地址：[bigcode/Megatron - LM](https://github.com/bigcode-project/Megatron - LM)
- 項目網站：[bigcode - project.org](https://www.bigcode - project.org)
- 相關論文：[💫StarCoder: May the source be with you!](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)
- 聯繫郵箱：[contact@bigcode - project.org](mailto:contact@bigcode - project.org)

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
device = "cuda"

input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
                "### Instruction:\n"
                "{instruction}\n\n"
                "### Input:\n"
                "{input}\n\n"
                "### Response:")

prompt = "Please explain the following program."
extra_input = "send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7:      *to = *from++; case 6:      *to = *from++; case 5:      *to = *from++; case 4:      *to = *from++; case 3:      *to = *from++; case 2:      *to = *from++; case 1:      *to = *from++; } while (--n > 0); } }"
prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
                  
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

示例輸出

示例輸出詳情

完整提示:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
What does this program do and how does it work?
### Input:
send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    }
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
    }
}
### Response:

回覆內容:

This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.

To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:

foo(to, from, count)

歸因及其他要求

該模型的預訓練數據集僅篩選了許可寬鬆的內容。不過，模型可能會直接生成數據集中的源代碼。代碼的許可協議可能要求進行歸因和/或滿足其他特定要求，必須予以遵守。BigCode項目提供了一個搜索索引，可用於搜索預訓練數據，以確定生成的代碼來源，並對代碼進行適當的歸因。

📚 詳細文檔

預期用途

基礎模型在GitHub代碼上進行訓練，然後進行微調以遵循指令。像“編寫一個計算平方根的函數”這樣的提示應該能有較好的效果。原倉庫建議使用[技術助手提示](https://huggingface.co/datasets/bigcode/ta - prompt)進行少樣本提示，使模型表現得像一個技術助手。這個微調模型使用了[Alpaca提示](https://github.com/tatsu - lab/stanford_alpaca/blob/main/train.py)。

侷限性

該模型在80多種編程語言的源代碼上進行訓練。源代碼中主要使用英語，也包含其他語言。因此，模型能夠在一定上下文下生成代碼片段，但生成的代碼不能保證按預期工作，可能效率低下、包含錯誤或漏洞。有關模型侷限性的深入討論，請參閱[原始論文](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)。微調過程使模型對用戶的直接輸入更具響應性，但這只是對Starcoder模型進行指令微調的早期嘗試，結果可能無法代表模型的全部潛力。

訓練情況

模型參數

屬性	詳情
架構	具有多查詢注意力和中間填充目標的GPT - 2模型
預訓練步數	250k
預訓練標記數	1萬億
精度	bfloat16
微調指令 - 響應對數	4.5k
微調上下文長度	1024
微調輪數	3
微調學習率	2e - 5
微調優化方法	FSDP

硬件資源

屬性	詳情
GPU	8個Tesla A100
訓練時間	5小時

📄 許可證

該模型遵循BigCode OpenRAIL - M v1許可協議。您可以在[此處](https://huggingface.co/spaces/bigcode/bigcode - model - license - agreement)找到完整協議。此模型還使用了OpenAI的GPT - 4的輸出進行微調，因此還需遵守[OpenAI的服務條款](https://openai.com/policies/terms - of - use)。

引用信息

基礎模型的Hugging Face倉庫可在此處找到。

@article{li2023starcoder,
      title={StarCoder: may the source be with you!}, 
      author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy-Poirier and João Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming-Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan-Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muñoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2023},
      eprint={2305.06161},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM排行榜評估結果

詳細結果可在[此處](https://huggingface.co/datasets/open - llm - leaderboard/details_GeorgiaTechResearchInstitute__starcoder - gpteacher - code - instruct)查看。

指標	值
平均值	32.57
ARC (25 - shot)	32.68
HellaSwag (10 - shot)	47.6
MMLU (5 - shot)	28.63
TruthfulQA (0 - shot)	40.41
Winogrande (5 - shot)	55.56
GSM8K (5 - shot)	0.0
DROP (3 - shot)	23.11