OctoCoder Open-source Code Generation Model - Supports over 80 programming languages and helps with code writing for free

Octocoder

Developed by bigcode

OctoCoder is a 15.5 billion parameter instruction-tuned model, fine-tuned from StarCoder on the CommitPackFT and OASST datasets, supporting 80+ programming languages.

Large Language Model

Transformers

OtherOpen Source License:Openrail #Multilingual code generation #Instruction-tuned model #GitHub commit optimization

Downloads 144

Release Time : 7/23/2023

Model Overview

OctoCoder is a large language model focused on code generation, capable of producing high-quality code snippets based on instructions and supporting multiple programming languages.

Model Features

Multilingual code generation

Supports code generation in over 80 programming languages

Instruction tuning

Instruction-tuned on CommitPackFT and OASST datasets for better understanding and execution of programming instructions

High-quality code generation

Performs excellently in HumanEvalPack evaluations, particularly in Python code generation

Model Capabilities

Code generation

Code repair

Code explanation

Multilingual programming support

Use Cases

Programming assistance

Algorithm implementation

Generates implementation code for specific algorithms based on instructions

Achieves 46.2% pass@1 accuracy in HumanEvalSynthesize Python evaluation

Code repair

Identifies and fixes errors in code

Achieves 30.4% pass@1 accuracy in HumanEvalFix Python evaluation

Code explanation

Generates explanations for code snippets

Achieves 35.1% pass@1 accuracy in HumanEvalExplain Python evaluation

Education

Programming teaching

Generates example code and explanations for teaching purposes

🚀 OctoCoder

OctoCoder is an instruction - tuned model that can handle various programming language tasks. It is created by finetuning StarCoder on specific datasets, aiming to provide high - quality code generation and instruction - following capabilities.

🚀 Quick Start

Intended use

The model follows instructions provided in the input. You should always preface your input with "Question: " and finish it with "Answer:", for example: "Question: Please write a function in Python that performs bubble sort.\n\nAnswer:"

Feel free to share your generations in the Community tab!

Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/octocoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("Question: Please write a function in Python that performs bubble sort.\n\nAnswer:", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

✨ Features

Multilingual Support: Capable of handling over 80 programming languages.
Instruction - Following: Can accurately follow instructions provided in the input for code generation.

📦 Installation

Although not explicitly provided in the original, if you want to use the model for generation, you need to install the transformers library:

pip install -q transformers

📚 Documentation

Model Summary

Repository: [bigcode - project/octopack](https://github.com/bigcode - project/octopack)
Paper: OctoPack: Instruction Tuning Code Large Language Models
Languages: 80+ Programming languages

Property	Details
Model Type	OctoCoder is an instruction tuned model with 15.5B parameters created by finetuning StarCoder on CommitPackFT & OASST as described in the OctoPack paper.
Training Data	- CommitPack: 4TB of GitHub commits across 350 programming languages - CommitPackFT: Filtered version of CommitPack for high - quality commit messages that resemble instructions - OASST

OctoPack🐙🎒

Data/Model/Evaluation	Name	Details
Data	CommitPack	4TB of GitHub commits across 350 programming languages
Data	CommitPackFT	Filtered version of CommitPack for high - quality commit messages that resemble instructions
Model	OctoCoder	StarCoder (16B parameters) instruction tuned on CommitPackFT + OASST
Model	OctoGeeX	CodeGeeX2 (6B parameters) instruction tuned on CommitPackFT + OASST
Evaluation	HumanEvalPack	Extension of OpenAI's HumanEval to cover 3 scenarios across 6 languages

Training

Model

Architecture: GPT - 2 model with multi - query attention and Fill - in - the - Middle objective
Steps: 250k pretraining & 30 instruction tuning
Pretraining tokens: 1 trillion pretraining & 2M instruction tuning
Precision: bfloat16

Hardware

Pretraining:
- GPUs: 512 Tesla A100
- Training time: 24 days
Instruction tuning:
- GPUs: 8 Tesla A100
- Training time: 4 hours

Software

Orchestration: [Megatron - LM/Transformers](https://github.com/bigcode - project/octopack#training)
Neural networks: PyTorch

Results

Task	Dataset	Metric	Value	Verified
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize Python)	pass@1	46.2	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize JavaScript)	pass@1	39.2	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize Java)	pass@1	38.2	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize Go)	pass@1	30.4	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize C++)	pass@1	35.6	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize Rust)	pass@1	23.4	false
Text Generation	bigcode/humanevalpack (HumanEvalSynthesize Average)	pass@1	35.5	false
Text Generation	bigcode/humanevalpack (HumanEvalFix Python)	pass@1	30.4	false
Text Generation	bigcode/humanevalpack (HumanEvalFix JavaScript)	pass@1	28.4	false
Text Generation	bigcode/humanevalpack (HumanEvalFix Java)	pass@1	30.6	false
Text Generation	bigcode/humanevalpack (HumanEvalFix Go)	pass@1	30.2	false
Text Generation	bigcode/humanevalpack (HumanEvalFix C++)	pass@1	26.1	false
Text Generation	bigcode/humanevalpack (HumanEvalFix Rust)	pass@1	16.5	false
Text Generation	bigcode/humanevalpack (HumanEvalFix Average)	pass@1	27.0	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain Python)	pass@1	35.1	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain JavaScript)	pass@1	24.5	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain Java)	pass@1	27.3	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain Go)	pass@1	21.1	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain C++)	pass@1	24.1	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain Rust)	pass@1	14.8	false
Text Generation	bigcode/humanevalpack (HumanEvalExplain Average)	pass@1	24.5	false

📄 License

The model is released under the bigcode - openrail - m license.

📖 Citation

@article{muennighoff2023octopack,
      title={OctoPack: Instruction Tuning Code Large Language Models}, 
      author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre},
      journal={arXiv preprint arXiv:2308.07124},
      year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご