Open-source replit-code-v1-3b model - Free deployment to achieve efficient code completion function

Home

Replit

Developed by lentan

replit-code-v1-3b is a 2.7B parameter causal language model focused on code completion, developed by Replit, Inc.

Large Language Model

Transformers

Other#Multilingual code completion #Large parameter model #Code generation

Downloads 60

Release Time : 5/6/2023

Model Overview

The model is trained on a subset of the Stack Dedup v1.2 dataset, supports 20 programming languages, and is primarily used for code generation and completion tasks.

Model Features

Multilingual support

Supports 20 programming languages, including mainstream languages such as Python, Java, and JavaScript.

Efficient training

Utilizes Flash Attention and AliBi positional embeddings for fast training and inference.

Optimized tokenizer

Custom SentencePiece Unigram tokenizer with a vocabulary of 32768 tokens optimized for code.

Model Capabilities

Code completion

Code generation

Multilingual support

Use Cases

Development tools

IDE plugin

Integrated into development environments to provide real-time code completion.

Improves development efficiency and reduces coding errors.

Code generation

Generates code snippets based on natural language descriptions.

Enables rapid prototyping and reduces manual coding time.

🚀 replit-code-v1-3b

replit-code-v1-3b is a 2.7B Causal Language Model designed for Code Completion. It's trained on a subset of the Stack Dedup v1.2 dataset, offering high - quality code generation capabilities.

🧑‍💻 Test it on our Demo Space! 🧑‍💻

🚀 Quick Start

First, install the latest versions of the following dependencies:

einops
sentencepiece
torch
transformers

Then, you can load the model:

from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

✨ Features

Multilingual Support: Trained on 20 different languages, including Markdown, Java, JavaScript, etc.
Large - Scale Training: Trained on 525B tokens, with 175B tokens repeated over 3 epochs.
Advanced Techniques: Utilizes state - of - the - art techniques like Flash Attention, AliBi positional embeddings, and LionW optimizer.

📦 Installation

Install Basic Dependencies

einops
sentencepiece
torch
transformers

Install Dependencies for Optimized Triton Implementation

flash-attn==0.2.8
triton==2.0.0.dev20221202

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# forward pass
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
y = model(x)

Advanced Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

📚 Documentation

Model Description

replit-code-v1-3b is a 2.7B Causal Language Model focused on Code Completion. It has been trained on a subset of the Stack Dedup v1.2 dataset.

The training mixture includes 20 different languages, listed in descending order of number of tokens: Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, Shell

In total, the training dataset contains 175B tokens, repeated over 3 epochs. So, replit-code-v1-3b has been trained on 525B tokens (~195 tokens per parameter).

The model was trained on the MosaicML platform with 256 x A100 - 40GB GPUs, using their latest LLM examples repo.

Intended Use

Replit intends this model to be used as a foundational model for application - specific fine - tuning, with no strict limitations on commercial use.

Limitations

The pre - training dataset may contain offensive or inappropriate content even after data cleansing. Such content may appear in the model's generated text. Users should exercise caution when using it in production systems and avoid using it for applications that may cause harm.

Tokenizer

We trained a custom SentencePiece Unigram tokenizer with a 32768 - token vocabulary optimized for code. Using this requires the sentencepiece library.

from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)

# single input encoding + generation
x = tokenizer.encode('def hello():\n  print("hello world")\n', return_tensors='pt')
y = model.generate(x)

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

Generation

You can generate code using the transformers library. Experiment with different decoding methods and parameters for the best results.

Post Processing

Post - processing of the generated code is crucial. Recommended steps include:

Stop generation when the EOS token is encountered.
Remove trailing whitespaces.
Set max_tokens based on your use case.
Truncate generation to stop words to avoid incomplete code.

🔧 Technical Details

The model has been trained on the MosaicML platform with 256 x A100 - 40GB GPUs. It leverages techniques like Flash Attention for fast training and inference, AliBi positional embeddings to support variable context length at inference time, and the LionW optimizer.

📄 License

The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY - SA - 4.0). Under the license, you must credit Replit, provide a link to the license, and indicate if changes were made.

📊 Model Information

Property	Details
Model Name	replit - code - v1 - 3b
Model Type	2.7B Causal Language Model
Training Data	Subset of Stack Dedup v1.2 dataset, 525B tokens in total
Training Platform	MosaicML with 256 x A100 - 40GB GPUs
Evaluation Dataset	HumanEval
pass@1	0.219
Model Hash	5bc28ce32c6f9aec935ead7b60ea1c46

⚠️ Important Note

The pre - training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.

💡 Usage Tip Experiment with different decoding methods and parameters to get the best results for your use case. Also, perform post - processing on the generated code as recommended to ensure its quality.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご