Starcoder2-7bオープンソースコード生成モデル - 17種類の言語をサポート、長いコンテキストウィンドウが超実用的

Home

Starcoder2 7b

Developed by bigcode

StarCoder2-7Bは70億パラメータのコード生成モデルで、17のプログラミング言語でトレーニングされ、16,384トークンのコンテキストウィンドウをサポートします。

大規模言語モデル

Transformers

OtherOpen Source License:Openrail #プログラミングコード生成 #17のプログラミング言語 #3.5兆トークンのトレーニング

Downloads 58.21k

Release Time : 2/20/2024

Model Overview

このモデルはコード生成タスクに特化しており、GitHubコードやその他の選択されたデータソースでトレーニングされており、コードスニペットの生成には適していますが、指示タスクには適していません。

Model Features

長いコンテキストサポート

16,384トークンのコンテキストウィンドウと4,096トークンのスライディングウィンドウアテンションをサポート

効率的なトレーニング

中間ターゲット埋め込み技術を使用して3.5+兆トークンでトレーニング

多言語サポート

17のプログラミング言語のコード生成をサポート

Model Capabilities

コード自動補完

関数生成

コードスニペット生成

Use Cases

ソフトウェア開発

関数生成

関数シグネチャに基づいて関数の実装を自動生成

HumanEvalデータセットで35.4%のpass@1精度を達成

コード補完

IDEでインテリジェントなコード補完を提供

RepoBench-v1.1で72.07の編集類似度を達成

教育

プログラミング学習支援

学習者にコード例や解決策を提供

🚀 StarCoder2

StarCoder2-7Bモデルは、17のプログラミング言語に対応したコード生成モデルです。大規模なデータセットで訓練され、高い性能を発揮します。

🚀 クイックスタート

このセクションでは、StarCoder2モデルの基本的な使い方を説明します。まずは必要なライブラリをインストールし、モデルをロードしてコード生成を試してみましょう。

インストール

transformersライブラリをソースからインストールします。

pip install git+https://github.com/huggingface/transformers.git

モデルの実行

以下のコードを使って、CPUまたはGPUでモデルを実行できます。

フル精度での実行

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-7b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 29232.57 MB

`torch.bfloat16`を使用した実行

# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 14616.29 MB

`bitsandbytes`を使用した量子化バージョン

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# to use 4bit use `load_in_4bit=True` instead
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
# load_in_8bit
Memory footprint: 7670.52 MB
# load_in_4bit
>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 4197.64 MB

✨ 主な機能

多言語対応：17のプログラミング言語に対応しています。
大規模データセットでの訓練：3.5兆以上のトークンで訓練されています。
高度なアーキテクチャ：Grouped Query Attention、Sliding Window Attention、Fill-in-the-Middle objectiveを使用しています。

📦 インストール

transformersライブラリをソースからインストールする必要があります。

pip install git+https://github.com/huggingface/transformers.git

💻 使用例

基本的な使用法

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-7b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

高度な使用法

# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

📚 ドキュメント

モデル概要

StarCoder2-7Bモデルは、The Stack v2からの17のプログラミング言語のソースコードで訓練された70億パラメータのモデルです。モデルはGrouped Query Attention、16,384トークンのコンテキストウィンドウ、4,096トークンのスライディングウィンドウアテンションを使用し、Fill-in-the-Middle objectiveで3.5兆以上のトークンで訓練されています。

プロジェクトウェブサイト: bigcode-project.org
論文: Link
問い合わせ先: contact@bigcode-project.org
言語: 17のプログラミング言語

使用目的

このモデルはGitHubのコードや、ArxivやWikipediaなどの追加のデータソースで訓練されています。したがって、これは命令モデルではなく、「平方根を計算する関数を書いてください」などのコマンドはうまく機能しません。

帰属とその他の要件

このモデルの事前学習データセットは、許容的なライセンスとライセンスのないコードのみにフィルタリングされています。それにもかかわらず、モデルはデータセットからのソースコードを逐語的に生成することができます。コードのライセンスには帰属やその他の特定の要件が必要な場合があり、それらを尊重する必要があります。私たちは、生成されたコードがどこから来たかを特定し、コードに適切な帰属を適用するために、事前学習データを検索できる検索インデックスを提供しています。

🔧 技術詳細

モデル

アーキテクチャ: グループ化クエリとスライディングウィンドウアテンションおよびFill-in-the-Middleオブジェクティブを持つTransformerデコーダー
事前学習ステップ: 100万
事前学習トークン: 3.5兆以上
精度: bfloat16

ハードウェア

GPU: 432台のH100

ソフトウェア

フレームワーク: nanotron
ニューラルネットワーク: PyTorch

📄 ライセンス

このモデルは、BigCode OpenRAIL-M v1ライセンス契約の下で提供されています。完全な契約はこちらで確認できます。

📚 引用

@misc{lozhkov2024starcoder,
      title={StarCoder 2 and The Stack v2: The Next Generation}, 
      author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2024},
      eprint={2402.19173},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

情報テーブル

属性	详情
モデルタイプ	StarCoder2-7Bは、コード生成用のTransformerデコーダーモデルです。
訓練データ	The Stack v2からの17のプログラミング言語のソースコード。