starcoder-gpteacher-code-instructオープンソースモデル - 無料でデプロイしてコード生成と解釈能力を向上させる

ホーム

Starcoder Gpteacher Code Instruct

GeorgiaTechResearchInstituteによって開発

StarCoderモデルをベースに、GPTeacherコード生成データセットで微調整を行い、コード生成と説明能力を最適化

大規模言語モデル

Transformers

オープンソースライセンス:Openrail #コード命令の微調整 #多言語コード生成 #8192長文コンテキスト

ダウンロード数 122

リリース時間 : 5/5/2023

モデル概要

このモデルは155億パラメータのStarCoderモデルをベースに、GPT-4で生成されたコード命令データで微調整されており、コード生成と説明タスクに特化、80以上のプログラミング言語をサポート

モデル特徴

大規模コンテキストウィンドウ

8192トークンのコンテキストウィンドウをサポート、長いコード片の処理に適している

多言語サポート

トレーニングデータは80以上のプログラミング言語をカバーし、幅広い言語適応性を有する

命令微調整の最適化

GPT-4で生成されたコード命令データで微調整されており、ユーザー命令への応答がより良好

モデル能力

コード生成

コード説明

プログラミング問題解決

コード補完

使用事例

コード開発支援

関数生成

自然言語の記述に基づいて特定機能のコード関数を生成

要求に合致する関数実装を生成可能

コード説明

複雑なコード片のロジックと機能を説明

明確で正確なコード説明を提供

プログラミング教育

プログラミング学習支援

学習者がプログラミング概念とコード実装を理解するのを支援

理解しやすい説明と例を提供

🚀 StarCoder GPTeacher-Codegen 微調整済み

このモデルは、teknium1/GPTeacher コード生成データセット（GPT-4コード命令微調整）で微調整された bigcode/starcoder です。

📚 モデルの詳細

ベースのStarCoderモデルは、The Stack (v1.2) の80以上のプログラミング言語でトレーニングされた155億パラメータのモデルです。オプトアウト要求されたデータは除外されています。このモデルは Multi Query Attention、8192トークンのコンテキストウィンドウを使用し、Fill-in-the-Middle目的で1兆トークンでトレーニングされています。

リポジトリ: bigcode/Megatron-LM
プロジェクトウェブサイト: bigcode-project.org
論文: 💫StarCoder: May the source be with you!
問い合わせ先: contact@bigcode-project.org
言語: 80以上のプログラミング言語

✨ 主な機能

意図された使用方法

ベースモデルはGitHubのコードでトレーニングされ、その後命令に従うように微調整されました。「平方根を計算する関数を書く」などのプロンプトは、かなりうまく機能するはずです。元のリポジトリでは、Tech Assistant prompt を使用して、技術アシスタントとして振る舞うように少ない例でプロンプトすることが推奨されています。この微調整されたモデルは Alpaca prompts を使用しています。

生成

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
device = "cuda"

input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
                "### Instruction:\n"
                "{instruction}\n\n"
                "### Input:\n"
                "{input}\n\n"
                "### Response:")

prompt = "Please explain the following program."
extra_input = "send(to, from, count)
              register short *to, *from;
              register count;
              {
                register n = (count + 7) / 8;
                switch (count % 8) {
                case 0: do { *to = *from++;
                case 7:      *to = *from++;
                case 6:      *to = *from++;
                case 5:      *to = *from++;
                case 4:      *to = *from++;
                case 3:      *to = *from++;
                case 2:      *to = *from++;
                case 1:      *to = *from++;
                  } while (--n > 0);
                }
              }"
prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
                  
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

出力例

完全なプロンプト:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
What does this program do and how does it work?
### Input:
send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    }
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
    }
}
### Response:

応答:

This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.

To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:

foo(to, from, count)

帰属とその他の要件

このモデルの事前学習データセットは、許容的なライセンスのみにフィルタリングされています。それにもかかわらず、モデルはデータセットからソースコードをそのまま生成することができます。コードのライセンスには、帰属表示やその他の特定の要件が必要になる場合があり、それらは尊重する必要があります。BigCodeプロジェクトは、検索インデックスを提供しており、事前学習データを検索して、生成されたコードの出所を特定し、コードに適切な帰属表示を適用することができます。

🔧 技術的な制限

このモデルは80以上のプログラミング言語のソースコードでトレーニングされています。ソースの主な言語は英語ですが、他の言語も含まれています。そのため、モデルはある程度のコンテキストを与えることでコードスニペットを生成することができますが、生成されたコードが意図した通りに動作することは保証されていません。非効率的であったり、バグや脆弱性を含んでいる可能性があります。モデルの制限についての詳細な議論は、元の論文を参照してください。微調整プロセスにより、モデルはユーザーの直接的な入力に対してより応答的になりますが、これはStarcoderモデルの命令微調整の初期段階の試みであり、結果はモデルの完全な潜在能力を表すものではない可能性があります。

📦 トレーニング

モデル

属性	詳情
アーキテクチャ	GPT - 2モデル（マルチクエリアテンションとFill - in - the - Middle目的を使用）
事前学習ステップ	250k
事前学習トークン	1兆
精度	bfloat16
微調整命令応答ペア	4.5k
微調整コンテキスト長	1024
微調整エポック数	3
微調整学習率	2e - 5
微調整最適化手法	FSDP

ハードウェア

属性	詳情
GPU	8台のTesla A100
トレーニング時間	5時間

📄 ライセンス

このモデルはBigCode OpenRAIL - M v1ライセンス契約の下でライセンスされています。完全な契約はこちらで確認できます。このモデルはOpenAIのGPT - 4の出力を使用して微調整されているため、[OpenAIの利用規約](https://openai.com/policies/terms - of - use) の対象となります。

引用

ベースモデルのHFリポジトリはこちらです。

@article{li2023starcoder,
      title={StarCoder: may the source be with you!}, 
      author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy - Poirier and João Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming - Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan - Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muñoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2023},
      eprint={2305.06161},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM Leaderboard評価結果

詳細な結果は [こちら](https://huggingface.co/datasets/open - llm - leaderboard/details_GeorgiaTechResearchInstitute__starcoder - gpteacher - code - instruct) で確認できます。

指標	値
平均	32.57
ARC (25 - shot)	32.68
HellaSwag (10 - shot)	47.6
MMLU (5 - shot)	28.63
TruthfulQA (0 - shot)	40.41
Winogrande (5 - shot)	55.56
GSM8K (5 - shot)	0.0
DROP (3 - shot)	23.11