Starcoder2-3Bオープンソースコード生成モデル - 17の言語をサポート、長文脈のコード生成で安心

ホーム

Starcoder2 3b

bigcodeによって開発

StarCoder2-3Bは30億パラメータを持つコード生成モデルで、17のプログラミング言語でトレーニングされ、16,384トークンのコンテキストウィンドウをサポートします。

大規模言語モデル

Transformers

その他オープンソースライセンス:Openrail #コード生成 #複数プログラミング言語対応 #長いコンテキストウィンドウ

ダウンロード数 199.62k

リリース時間 : 11/29/2023

モデル概要

StarCoder2-3Bはコード生成に特化したモデルで、The Stack v2データセットでトレーニングされ、複数のプログラミング言語のコード補完と生成タスクをサポートします。

モデル特徴

大規模コンテキストウィンドウ

16,384トークンのコンテキストウィンドウをサポートし、長いコードスニペットの処理に適しています。

多言語サポート

17のプログラミング言語をサポートし、幅広い開発ニーズに対応します。

効率的なトレーニング

3兆以上のトークンでトレーニングされ、中間埋め込み目標で最適化されています。

モデル能力

コード補完

コード生成

コード理解

使用事例

ソフトウェア開発

関数自動補完

関数の冒頭に基づいて完全な関数実装を自動補完

HumanEvalデータセットでpass@1が31.7%を達成

アルゴリズム実装

問題説明に基づいてアルゴリズム実装コードを生成

DS-1000データセットでpass@1が25.0%を達成

教育

プログラミング学習支援

学習者にコード例と解決策を提供

🚀 StarCoder2

StarCoder2-3Bモデルは、The Stack v2 の17のプログラミング言語でトレーニングされた30億パラメータのモデルです。このモデルは、コード生成などのタスクに役立ちます。

🚀 クイックスタート

インストール

まず、transformers をソースからインストールします。

pip install git+https://github.com/huggingface/transformers.git

モデルの実行

CPU/GPU/マルチGPUでの実行

完全精度の使用

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-3b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 12624.81 MB

torch.bfloat16 の使用

# pip install accelerate
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "bigcode/starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# for fp16 use `torch_dtype=torch.float16` instead
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 6312.41 MB

`bitsandbytes` による量子化バージョン

8ビット精度 (int8) の使用

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# to use 4bit use `load_in_4bit=True` instead
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

checkpoint = "bigcode/starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
# load_in_8bit
Memory footprint: 3434.07 MB
# load_in_4bit
>>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Memory footprint: 1994.90 MB

✨ 主な機能

StarCoder2-3Bモデルは、17のプログラミング言語でトレーニングされています。
Grouped Query Attention を使用しています。
16,384トークンのコンテキストウィンドウと 4,096トークンのスライディングウィンドウアテンションを持っています。
Fill-in-the-Middle objective を使用して、3兆以上のトークンでトレーニングされています。

📚 ドキュメント

モデル概要

StarCoder2-3Bモデルは、The Stack v2 の17のプログラミング言語でトレーニングされた30億パラメータのモデルです。オプトアウト要求があったデータは除外されています。

プロジェクトウェブサイト: bigcode-project.org
論文: リンク
問い合わせ先: contact@bigcode-project.org
言語: 17のプログラミング言語

使用目的

このモデルは、GitHubのコードや、ArxivやWikipediaなどの追加の選択されたデータソースでトレーニングされています。したがって、これは命令モデルではなく、「平方根を計算する関数を書いてください。」のようなコマンドはうまく機能しません。

帰属とその他の要件

このモデルの事前学習データセットは、許容的なライセンスとライセンスのないコードのみにフィルタリングされています。それにもかかわらず、モデルはデータセットからソースコードを逐語的に生成することができます。コードのライセンスには、帰属やその他の特定の要件が必要になる場合があり、それらを尊重する必要があります。私たちは、生成されたコードがどこから来たかを特定し、コードに適切な帰属を適用するために、事前学習データを検索できる検索インデックスを提供しています。

制限事項

このモデルは、600以上のプログラミング言語のソースコードでトレーニングされています。ソースの主な言語は英語ですが、他の言語も含まれています。したがって、このモデルはある程度のコンテキストを与えることでコードスニペットを生成することができますが、生成されたコードが意図した通りに動作することは保証されていません。非効率で、バグや脆弱性を含む可能性があります。モデルの制限事項の詳細については、論文を参照してください。

トレーニング

モデル

アーキテクチャ: グループ化クエリとスライディングウィンドウアテンションを持つTransformerデコーダとFill-in-the-Middle目的
事前学習ステップ: 120万
事前学習トークン: 3兆以上
精度: bfloat16

ハードウェア

GPU: 160台のA100

ソフトウェア

フレームワーク: TODO
ニューラルネットワーク: PyTorch

📄 ライセンス

このモデルは、BigCode OpenRAIL-M v1ライセンス契約の下でライセンスされています。完全な契約はこちらで確認できます。

📚 引用

@misc{lozhkov2024starcoder,
      title={StarCoder 2 and The Stack v2: The Next Generation}, 
      author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2024},
      eprint={2402.19173},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

情報テーブル

属性	詳情
パイプラインタグ	テキスト生成
推論	有効
データセット	bigcode/the-stack-v2-train
ライセンス	bigcode-openrail-m
ライブラリ名	transformers
タグ	コード
モデル名	starcoder2-3b
タスクタイプ	テキスト生成
評価データセット	CruxEval-I、DS-1000、GSM8K (PAL)、HumanEval+、HumanEval、RepoBench-v1.1
評価指標	pass@1、accuracy、edit-smiliarity
評価値	32.7、25.0、27.7、27.4、31.7、71.19