開源replit-code-v1-3b模型 - 免費部署實現高效代碼補全功能

首頁

Replit

由lentan開發

replit-code-v1-3b 是一個專注於代碼補全的2.7B參數因果語言模型，由Replit, Inc.開發。

大型語言模型

Transformers

其他#多語言代碼補全 #大參數模型 #代碼生成

下載量 60

發布時間 : 5/6/2023

模型概述

該模型基於Stack Dedup v1.2數據集的子集訓練，支持20種編程語言，主要用於代碼生成和補全任務。

模型特點

多語言支持

支持20種編程語言，包括Python、Java、JavaScript等主流語言。

高效訓練

使用Flash Attention和AliBi位置嵌入技術，實現快速訓練和推理。

優化分詞器

定製SentencePiece Unigram分詞器，針對代碼優化了32768個標記的詞彙表。

模型能力

代碼補全

代碼生成

多語言支持

使用案例

開發工具

IDE插件

集成到開發環境中，提供即時代碼補全功能。

提高開發效率，減少編碼錯誤。

代碼生成

根據自然語言描述生成代碼片段。

快速原型開發，減少手動編碼時間。

🚀 replit-code-v1-3b

replit-code-v1-3b 是一款專注於代碼補全的 27 億參數因果語言模型。它基於特定數據集訓練，能為開發者提供代碼生成支持，助力高效編程。

🚀 快速開始

首先，你需要安裝以下依賴的最新版本：

einops
sentencepiece
torch
transformers

然後，你可以按如下方式加載模型：

from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

若要在支持 BF16 精度的 GPU 上使用 FlashAttention 的優化 Triton 實現，需先安裝以下依賴：

flash-attn==0.2.8
triton==2.0.0.dev20221202

接著，將模型移至 bfloat16 並按如下方式使用：

from transformers import AutoModelForCausalLM
import torch

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True, attn_impl='triton')
model.to(device='cuda:0', dtype=torch.bfloat16)

# forward pass
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)

注意，由於 ReplitLM 並非 Transformers 庫中的類，因此在 from_pretrained 方法中需傳入 trust_remote_code=True。

✨ 主要特性

多語言支持：該模型在訓練中涵蓋了 20 種不同語言，按訓練時使用的標記數量降序排列為：Markdown、Java、JavaScript、Python、TypeScript、PHP、SQL、JSX、reStructuredText、Rust、C、CSS、Go、C++、HTML、Vue、Ruby、Jupyter Notebook、R、Shell。
大規模訓練：模型基於 Stack Dedup v1.2 數據集的一個子集進行訓練。訓練數據集總共包含 1750 億個標記，經過 3 個訓練週期，replit-code-v1-3b 模型總共在 5250 億 個標記上進行了訓練（每個參數約 195 個標記）。
先進技術加持：採用了諸如 Flash Attention 實現快速訓練和推理、AliBi 位置嵌入以支持推理時可變的上下文長度、LionW 優化器等先進的大語言模型技術。

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

高級用法

# 此示例展示瞭如何在支持 BF16 精度的 GPU 上使用優化的 FlashAttention 實現
from transformers import AutoModelForCausalLM
import torch

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True, attn_impl='triton')
model.to(device='cuda:0', dtype=torch.bfloat16)

x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x = x.to(device='cuda:0')
y = model(x)

print(y)

📚 詳細文檔

分詞器

我們訓練了一個自定義的 SentencePiece Unigram 分詞器，該分詞器針對代碼進行了優化，詞彙表包含 32768 個標記。

使用此分詞器需要安裝 sentencepiece 庫。可以按如下方式使用分詞器：

from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# single input encoding + generation
x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
y = model.generate(x)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

注意：

由於 ReplitLM 並非 Transformers 庫中的類，因此在 from_pretrained 方法中需傳入 trust_remote_code=True。
clean_up_tokenization_spaces=False 是為了避免在輸出中刪除空格，因為這會影響生成代碼的語法正確性。

代碼生成

你可以使用 transformers 庫按如下方式生成代碼：

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

你可以嘗試不同的解碼方法和參數，以獲得最適合你用例的結果。

後處理

請注意，與所有代碼生成模型一樣，對生成的代碼進行後處理非常重要。特別推薦以下後處理步驟：

遇到 EOS 標記時停止生成。
去除尾部空格。
根據你的補全用例將 max_tokens 設置為合理的值。
當 max_tokens 大於預期生成代碼的長度時，將生成結果截斷到諸如 return、def、"```"、"\n\n\n" 等停止詞，以避免生成不完整的代碼。

🔧 技術細節

replit-code-v1-3b 模型在 MosaicML 平臺上使用 256 個 A100 - 40GB GPU 進行訓練，藉助了他們最新的 LLM 示例倉庫。

模型在 Stack Dedup v1.2 數據集的一個子集上進行訓練，訓練混合集中包含 20 種不同語言。訓練數據集總共包含 1750 億個標記，經過 3 個訓練週期，模型總共在 5250 億個標記上進行了訓練（每個參數約 195 個標記）。

該模型採用了一系列先進的大語言模型技術，如 Flash Attention 實現快速訓練和推理、AliBi 位置嵌入以支持推理時可變的上下文長度、LionW 優化器等。

📄 許可證

模型檢查點和詞彙表文件遵循知識共享許可協議（CC BY - SA 4.0）。在該許可下，你必須向 Replit 致謝，提供許可鏈接，並說明是否進行了修改。你可以以任何合理的方式進行，但不得暗示 Replit 認可你或你的使用方式。

模型信息表格

屬性	詳情
模型類型	專注於代碼補全的因果語言模型
訓練數據	來自 Stack Dedup v1.2 數據集的子集，包含 20 種語言，共 1750 億個標記，經過 3 個訓練週期，模型總共在 5250 億個標記上訓練