StarcoderPlus Open-Source Code Generation Model - Supports Multi-Language Programming and Natural Language Processing Tasks

Starcoderplus

Developed by bigcode

StarCoderPlus is a powerful code generation model developed under the BigCode project, supporting multiple programming languages and natural language processing tasks.

Large Language Model

Transformers

Other#Code Generation #Multilingual Programming #Machine Learning Inference

Downloads 52

Release Time : 5/8/2023

Model Overview

StarCoderPlus is a versatile large language model focused on code generation and text comprehension tasks, suitable for programming assistance and multilingual text processing.

Model Features

Powerful Code Generation Capability

Capable of generating high-quality code snippets based on prompts, supporting multiple programming languages.

Multilingual Support

Supports not only programming languages but also handles various natural language tasks, such as Chinese and English.

High-performance Inference

Demonstrates excellent performance in multiple benchmarks, such as HumanEval and MMLU.

Model Capabilities

Code Generation

Text Understanding

Multilingual Processing

Common-sense Reasoning

Abstract Reasoning

Use Cases

Programming Assistance

Code Completion

Automatically generates complete code implementations based on function signatures or comments.

Achieved a 26.7% pass rate@1 in the HumanEval test.

Education

Machine Learning Concept Explanation

Explains complex machine learning concepts in simple language, such as gradient descent.

🚀 StarCoderPlus

StarCoderPlus is an instruction - tuned language model. You can interact with it at StarChat - Beta. It offers capabilities in text generation across English and over 80 programming languages, trained on diverse datasets.

🚀 Quick Start

Play with the instruction - tuned StarCoderPlus at StarChat - Beta.

✨ Features

Multi - language Support: Trained on English and 80+ programming languages.
Advanced Architecture: Uses Multi Query Attention, a context window of 8192 tokens, and was trained with the Fill - in - the - Middle objective.
Fine - tuned Model: Fine - tuned on a mix of high - quality datasets for better performance.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoderplus"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Advanced Usage

Fill - in - the - Middle

Fill - in - the - middle uses special tokens to identify the prefix/middle/suffix part of the input and output:

input_text = "<fim_prefix>def print_hello_world():\n    <fim_suffix>\n    print('Hello world!')<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

📚 Documentation

Model Summary

StarCoderPlus is a fine - tuned version of StarCoderBase on a mix of:

The English web dataset RefinedWeb (1x)
StarCoderData dataset from The Stack (v1.2) (1x)
A Wikipedia dataset that has been upsampled 5 times (5x)

It's a 15.5B parameter Language Model trained on English and 80+ programming languages. The model uses Multi Query Attention, a context window of 8192 tokens, and was trained using the Fill - in - the - Middle objective on 1.6 trillion tokens.

Repository: bigcode/Megatron - LM
Project Website: bigcode - project.org
Point of Contact: contact@bigcode - project.org
Languages: English & 80+ Programming languages

Intended use

The model was trained on English and GitHub code. As such it is not an instruction model and commands like "Write a function that computes the square root." do not work well. However, the instruction - tuned version in StarChat makes a capable assistant.

Feel free to share your generations in the Community tab!

Attribution & Other Requirements

The training code dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. We provide a search index that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.

🔧 Technical Details

Training

StarCoderPlus is a fine - tuned version on 600B English and code tokens of StarCoderBase, which was pre - trained on 1T code tokens. Below are the fine - tuning details:

Model

Architecture: GPT - 2 model with multi - query attention and Fill - in - the - Middle objective
Finetuning steps: 150k
Finetuning tokens: 600B
Precision: bfloat16

Hardware

GPUs: 512 Tesla A100
Training time: 14 days

Software

Orchestration: Megatron - LM
Neural networks: PyTorch
BP16 if applicable: apex

📄 License

The model is licensed under the BigCode OpenRAIL - M v1 license agreement. You can find the full agreement here.

Additional Information

Property	Details
Model Type	Text - generation
Training Data	bigcode/the - stack - dedup, tiiuae/falcon - refinedweb
Metrics	code_eval, mmlu, arc, hellaswag, truthfulqa
Library Name	transformers
Tags	code

⚠️ Important Note

The model has been trained on a mixture of English text from the web and GitHub code. Therefore it might encounter limitations when working with non - English text, and can carry the stereotypes and biases commonly encountered online. Additionally, the generated code should be used with caution as it may contain errors, inefficiencies, or potential vulnerabilities. For a more comprehensive understanding of the base model's code limitations, please refer to StarCoder paper.

💡 Usage Tip

The model is not an instruction model. For better performance with instructions, use the instruction - tuned version in StarChat.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご