CodeParrot Open-Source Python Code Generation Model - Free Automatic Python Code Generation

Codeparrot

Developed by codeparrot

CodeParrot is a Python code generation model based on the GPT-2 architecture (1.5 billion parameters), focusing on automatic Python code generation.

Large Language Model

Transformers

Other#Python code generation #GPT-2 architecture #Large parameter model

Downloads 1,342

Release Time : 3/2/2022

Model Overview

CodeParrot is a Python code generation model based on the GPT-2 architecture, designed to assist developers in automatically generating Python code snippets.

Model Features

High-performance code generation

Based on the GPT-2 architecture, it can efficiently generate Python code snippets.

Multi-stage training

The model has undergone two stages of training, optimizing its code generation capabilities.

Supports long context

Supports a context length of 1024, suitable for generating longer code snippets.

Model Capabilities

Python code generation

Code completion

Code snippet generation

Use Cases

Development assistance

Automatic function generation

Automatically generates function body code based on function signatures.

Generates executable Python code.

Code completion

Completes the remaining code based on partial code hints.

Improves development efficiency.

🚀 CodeParrot 🦜

CodeParrot 🦜 is a GPT - 2 model with 1.5B parameters, specifically trained to generate Python code. After the initial training and the release of v1.0, we further trained the model and released v1.1. Details are provided below.

🚀 Quick Start

✨ Features

CodeParrot is designed to generate Python code, leveraging the power of a GPT - 2 architecture.
It has been trained in multiple steps, with an updated version (v1.1) showing improved performance on code - generation benchmarks.

📦 Installation

The installation process is mainly about using the transformers library to load the model and tokenizer. There are two common ways to use CodeParrot:

💻 Usage Examples

Basic Usage

You can load the CodeParrot model and tokenizer directly in transformers:

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

Advanced Usage

Or you can use a pipeline:

from transformers import pipeline

pipe = pipeline("text - generation", model="codeparrot/codeparrot")
outputs = pipe("def hello_world():")

🔧 Technical Details

Training

The model was trained on the cleaned CodeParrot 🦜 dataset in two steps. After the initial training (v1.0), the model was trained for another 30k steps, resulting in v1.1. The training settings are shown in the following table:

Property	Details
Batch size	512 for both v1.0 and v1.1
Context size	1024 for both v1.0 and v1.1
Training steps	50,000 for v1.0; 30,000 for v1.1
Gradient accumulation	16 for both v1.0 and v1.1
Gradient checkpointing	True for both v1.0 and v1.1
Learning rate	2e - 4 for v1.0; 5e - 5 for v1.1
Weight decay	0.1 for both v1.0 and v1.1
Warmup steps	750 for both v1.0 and v1.1
Schedule	Cosine for both v1.0 and v1.1

The training was executed on 16 x A100 (40GB) GPUs, which amounts to roughly 26 + 15 billion tokens.

Performance

We evaluated the model on OpenAI's HumanEval benchmark, which consists of programming challenges. The performance metrics are as follows:

Metric	v1.0	v1.1
pass@1	3.58%	3.99%
pass@10	8.03%	8.69%
pass@100	14.96%	17.88%

The pass@k metric indicates the probability that at least one out of k generations passes the tests.

📚 Documentation

Here are some useful resources related to CodeParrot:

Dataset: full, train, valid
Code: repository
Spaces: generation, highlighting

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご