Starcoder2-7b开源代码生成模型 - 支持17种语言，长上下文窗口超实用

首页

Starcoder2 7b

由 bigcode 开发

StarCoder2-7B是一个70亿参数的代码生成模型，训练于17种编程语言，支持16,384个标记的上下文窗口。

大型语言模型

Transformers

其他开源协议:Openrail #编程代码生成 #17种编程语言 #3.5万亿标记训练

下载量 58.21k

发布时间 : 2/20/2024

模型简介

该模型专注于代码生成任务，在GitHub代码及其他选定数据源上训练，适合生成代码片段但不适合指令任务。

模型特点

长上下文支持

支持16,384个标记的上下文窗口和4,096个标记的滑动窗口注意力

高效训练

使用填充中间目标技术在3.5+万亿标记上训练

多语言支持

支持17种编程语言的代码生成

模型能力

代码自动补全

函数生成

代码片段生成

使用案例

软件开发

函数生成

根据函数签名自动生成函数实现

在HumanEval数据集上达到35.4% pass@1准确率

代码补全

在IDE中提供智能代码补全建议

在RepoBench-v1.1上达到72.07编辑相似度

教育

编程学习辅助

为学习者提供代码示例和解决方案

🚀 StarCoder2

StarCoder2-7B 模型是一个拥有 70 亿参数的模型，在 17 种编程语言的代码数据上进行训练，可用于代码生成等任务。

🚀 快速开始

安装依赖

首先，确保从源代码安装 transformers：

pip install git+https://github.com/huggingface/transformers.git

运行模型

在 CPU/GPU/多 GPU 上运行模型

使用全精度

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-7b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 29232.57 MB

使用 torch.bfloat16

# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 14616.29 MB

通过 `bitsandbytes` 使用量化版本

使用 8 位精度 (int8)

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# to use 4bit use `load_in_4bit=True` instead
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
# load_in_8bit
Memory footprint: 7670.52 MB
# load_in_4bit
>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 4197.64 MB

✨ 主要特性

多语言支持：在 17 种编程语言的代码数据上进行训练。
先进架构：采用分组查询注意力、滑动窗口注意力和中间填充目标等技术。
长上下文处理：支持 16384 个标记的上下文窗口。

📦 安装指南

确保从源代码安装 transformers：

pip install git+https://github.com/huggingface/transformers.git

💻 使用示例

基础用法

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-7b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

高级用法

# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

📚 详细文档

预期用途

该模型在 GitHub 代码以及 Arxiv 和 Wikipedia 等其他选定数据源上进行训练。因此，它不是一个指令模型，像“编写一个计算平方根的函数”这样的命令效果不佳。

生成代码

你可以在 StarCoder2 的 GitHub 仓库中找到微调脚本。

归属与其他要求

该模型的预训练数据集仅过滤了许可性许可证和无许可证的代码。然而，模型可以逐字生成数据集中的源代码。代码的许可证可能要求归属和/或其他特定要求，必须予以遵守。我们提供了一个搜索索引，可让你搜索预训练数据，以确定生成的代码来自何处，并对你的代码进行适当的归属。

🔧 技术细节

模型

架构：具有分组查询和滑动窗口注意力以及中间填充目标的 Transformer 解码器
预训练步骤：100 万步
预训练标记：3.5 万亿以上
精度：bfloat16

硬件

GPU：432 个 H100

软件

框架：nanotron
神经网络：PyTorch

📄 许可证

该模型遵循 BigCode OpenRAIL - M v1 许可协议。你可以在此处找到完整协议。

📚 引用

@misc{lozhkov2024starcoder,
      title={StarCoder 2 and The Stack v2: The Next Generation}, 
      author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2024},
      eprint={2402.19173},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

模型指标

属性	详情
模型类型	StarCoder2 - 7B
训练数据	The Stack v2
评估数据集	CruxEval - I、DS - 1000、GSM8K (PAL)、HumanEval+、HumanEval、RepoBench - v1.1
评估指标	pass@1、accuracy、edit - smiliarity