模型简介
模型特点
模型能力
使用案例
🚀 StarCoder GPTeacher-Codegen 微调模型
本模型是基于Transformer架构的文本生成模型,它在预训练模型的基础上进行了微调,能够根据给定的指令生成代码,在代码生成领域具有较高的实用性。
🚀 快速开始
本模型是在bigcode/starcoder
的基础上,使用teknium1/GPTeacher
代码生成数据集(GPT - 4代码指令微调)进行微调得到的。
✨ 主要特性
- 多语言支持:基础的StarCoder模型是一个具有155亿参数的模型,在来自The Stack (v1.2)的80多种编程语言上进行训练,排除了选择退出请求的数据。
- 先进技术应用:模型使用了多查询注意力机制(Multi Query Attention)、8192个标记的上下文窗口,并在1万亿个标记上使用中间填充目标(Fill - in - the - Middle objective)进行训练。
- 信息资源丰富:
- 仓库地址:[bigcode/Megatron - LM](https://github.com/bigcode-project/Megatron - LM)
- 项目网站:[bigcode - project.org](https://www.bigcode - project.org)
- 相关论文:[💫StarCoder: May the source be with you!](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)
- 联系邮箱:[contact@bigcode - project.org](mailto:contact@bigcode - project.org)
💻 使用示例
基础用法
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
device = "cuda"
input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
"### Instruction:\n"
"{instruction}\n\n"
"### Input:\n"
"{input}\n\n"
"### Response:")
prompt = "Please explain the following program."
extra_input = "send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } }"
prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
示例输出
示例输出详情
完整提示:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
What does this program do and how does it work?
### Input:
send(to, from, count)
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch (count % 8) {
case 0: *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}
while (--n > 0) {
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
*to = *from++;
}
}
### Response:
回复内容:
This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.
To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:
foo(to, from, count)
归因及其他要求
该模型的预训练数据集仅筛选了许可宽松的内容。不过,模型可能会直接生成数据集中的源代码。代码的许可协议可能要求进行归因和/或满足其他特定要求,必须予以遵守。BigCode项目提供了一个搜索索引,可用于搜索预训练数据,以确定生成的代码来源,并对代码进行适当的归因。
📚 详细文档
预期用途
基础模型在GitHub代码上进行训练,然后进行微调以遵循指令。像“编写一个计算平方根的函数”这样的提示应该能有较好的效果。原仓库建议使用[技术助手提示](https://huggingface.co/datasets/bigcode/ta - prompt)进行少样本提示,使模型表现得像一个技术助手。这个微调模型使用了[Alpaca提示](https://github.com/tatsu - lab/stanford_alpaca/blob/main/train.py)。
局限性
该模型在80多种编程语言的源代码上进行训练。源代码中主要使用英语,也包含其他语言。因此,模型能够在一定上下文下生成代码片段,但生成的代码不能保证按预期工作,可能效率低下、包含错误或漏洞。有关模型局限性的深入讨论,请参阅[原始论文](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)。微调过程使模型对用户的直接输入更具响应性,但这只是对Starcoder模型进行指令微调的早期尝试,结果可能无法代表模型的全部潜力。
训练情况
模型参数
属性 | 详情 |
---|---|
架构 | 具有多查询注意力和中间填充目标的GPT - 2模型 |
预训练步数 | 250k |
预训练标记数 | 1万亿 |
精度 | bfloat16 |
微调指令 - 响应对数 | 4.5k |
微调上下文长度 | 1024 |
微调轮数 | 3 |
微调学习率 | 2e - 5 |
微调优化方法 | FSDP |
硬件资源
属性 | 详情 |
---|---|
GPU | 8个Tesla A100 |
训练时间 | 5小时 |
📄 许可证
该模型遵循BigCode OpenRAIL - M v1许可协议。您可以在[此处](https://huggingface.co/spaces/bigcode/bigcode - model - license - agreement)找到完整协议。此模型还使用了OpenAI的GPT - 4的输出进行微调,因此还需遵守[OpenAI的服务条款](https://openai.com/policies/terms - of - use)。
引用信息
基础模型的Hugging Face仓库可在此处找到。
@article{li2023starcoder,
title={StarCoder: may the source be with you!},
author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy-Poirier and João Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming-Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan-Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muñoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
year={2023},
eprint={2305.06161},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Open LLM排行榜评估结果
详细结果可在[此处](https://huggingface.co/datasets/open - llm - leaderboard/details_GeorgiaTechResearchInstitute__starcoder - gpteacher - code - instruct)查看。
指标 | 值 |
---|---|
平均值 | 32.57 |
ARC (25 - shot) | 32.68 |
HellaSwag (10 - shot) | 47.6 |
MMLU (5 - shot) | 28.63 |
TruthfulQA (0 - shot) | 40.41 |
Winogrande (5 - shot) | 55.56 |
GSM8K (5 - shot) | 0.0 |
DROP (3 - shot) | 23.11 |



