NT-Java-1.1B Open-source Code Generation Model - Freely Empower Efficient Completion of Java Programming Tasks

NT Java 1.1B

Developed by infosys

NT-Java-1.1B is an open-source specialized Java code generation model, built upon StarCoderBase-1B with extended pre-training, specifically designed for Java programming tasks.

Large Language Model

Transformers

OtherOpen Source License:Openrail #Java code generation #Desktop-level deployment #Low-resource optimization

Downloads 52

Release Time : 5/5/2024

Model Overview

This is a decoder-only Transformer model with multi-query attention mechanism, featuring a context length of 8192 tokens. The model was trained using the Java subset of the StarCoderData dataset, with approximately 22 billion tokens of training data.

Model Features

Desktop-level deployment capability

As a small language model (SLM), it can be deployed on consumer-grade PCs, suitable for resource-constrained environments.

Java-specific optimization

Specially optimized for Java programming tasks, outperforming open-source models of similar scale in Java code generation tasks.

Long context support

Supports a context length of 8192 tokens, suitable for processing longer code segments.

Quantization support

Provides 8-bit and 4-bit quantized versions for easier deployment on different hardware.

Model Capabilities

Java code generation

Java code completion

Java fill-in-middle (FIM)

Java code understanding

Use Cases

Software development

IDE code completion

Provides intelligent code completion functionality in integrated development environments.

Improves development efficiency and reduces coding errors.

Teaching assistance

Used for generating example code in Java programming education.

Helps students understand programming concepts.

Code refactoring

Automatically generates code refactoring suggestions.

Improves code quality and maintainability.

🚀 NT-Java-1.1B: A Specialized Java Code Model

The NT-Java-1.1B is an open - source specialized code model designed for Java programming. It extends pre - training on StarCoderBase - 1B, offering efficient solutions for Java coding tasks.

🚀 Quick Start

Sample inference code

Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "infosys/NT-Java-1.1B"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("public class HelloWorld {\n    public static void main(String[] args) {", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Fill - in - the - middle

Fill - in - the - middle uses special tokens to identify the prefix/middle/suffix part of the input and output:

input_text = "<fim_prefix>public class PalindromeChecker {\n        public static boolean isPalindrome(String str) {\n          <fim_suffix>return true;\n      }\n<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Quantized Versions through `bitsandbytes`

Using 8 - bit precision (int8)

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# to use 4bit use `load_in_4bit=True` instead
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

checkpoint = "infosys/NT-Java-1.1B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)

inputs = tokenizer.encode("public class HelloWorld {\n    public static void main(String[] args) {", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

✨ Features

Specialized for Java: Tailored for Java programming tasks, offering high - quality code generation and completion.
Small and Deployable: As a small language model (SLM), it can be deployed on consumer - grade PCs.
Good Performance: Outperforms comparably sized open - source code models in Java programming tasks.

📦 Installation

The installation mainly involves installing the necessary libraries. For example, for the sample inference code, you need to run the following command:

pip install -q transformers

If you want to use the quantized version through bitsandbytes, you need to run:

pip install bitsandbytes accelerate

📚 Documentation

Model Summary

The Narrow Transformer (NT) model NT - Java - 1.1B is an open - source specialized code model. It is a decoder - only transformer with Multi - Query Attention and a context length of 8192 tokens. The model was trained with the Java subset of the StarCoderData dataset, which is ~ 22B tokens.

Repository: [Infosys/Megatron - LM](https://github.com/Infosys/Megatron - LM)
Paper: Narrow Transformer: Starcoder - Based Java - LM For Desktop
Language(s): Java

Intended Uses

Large code models require specialized hardware like GPUs for inference. The NT - Java - 1.1B, as an SLM, can be deployed on consumer - grade PCs. It is suitable for:

Use in memory/compute constrained environments.
Use in latency - sensitive scenarios.
Code generation and completion tasks in Java.
FIM (code infilling) tasks in Java.

Training

Model

Architecture: GPT - 2 model with Multi - Query Attention and Fill - in - the - Middle objective.
Training steps: 120K
Context length: 8K tokens
Pretraining tokens: 22 billion
Precision: bfloat16

Hardware

GPUs: 6 NVIDIA A100 80GB
Training time: 10 days

Software

Orchestration: [Megatron - LM](https://github.com/Infosys/Megatron - LM)
Neural networks: PyTorch

Attribution & Other Requirements

The pretraining dataset for the model was curated to include only data with permissive licenses. However, the model can generate source code verbatim from the dataset. BigCode provides a search index to help users trace the origins of generated code and comply with licensing requirements.

Limitations

The NT - Java - 1.1B model has been trained on publicly available datasets and offers no safety guarantees. Its outputs are unpredictable, and the generated code may be inefficient, contain bugs, or have security vulnerabilities. Users and developers should conduct extensive safety testing and implement filtering mechanisms.

🔧 Technical Details

The model is based on the GPT - 2 architecture with Multi - Query Attention and a Fill - in - the - Middle objective. It has a context length of 8192 tokens and was trained with 22 billion tokens from the Java subset of the StarCoderData dataset. The training was carried out for 120K steps with a precision of bfloat16 on 6 NVIDIA A100 80GB GPUs for 10 days using [Megatron - LM](https://github.com/Infosys/Megatron - LM) for orchestration and PyTorch for neural network implementation.

📄 License

The model is licensed under the BigCode OpenRAIL - M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode - model - license - agreement).

📚 Citation

@article{rathinasamy2024narrow,
  title={Narrow Transformer: Starcoder-Based Java-LM For Desktop},
  author={Kamalkumar Rathinasamy and Balaji A J and Rajab Ali Mondal and Ankush Kumar and Harshini K and Gagan Gayari and Sreenivasa Raghavan Karumboor Seshadri and Swayam Singh},
  journal={arXiv preprint arXiv:2407.03941},
  year={2024}
}

📋 Model Information Table

Property	Details
Model Type	NarrowTransformer
Training Data	bigcode/starcoderdata
Metrics	code_eval
Library Name	transformers
License	bigcode - openrail - m
Duplicated From	bigcode - data/starcoderbase - 1b
Results	Task: text - generation, Dataset: nuprl/MultiPL - E (MultiPL - HumanEval (Java)), pass@1: 20.2

⚠️ Important Note

The pretraining dataset for the model was curated to include only data with permissive licenses. However, the model can generate source code verbatim from the dataset. The licenses of such code may require attribution and adherence to other specific conditions. Use the search index provided by BigCode to trace the origins of generated code and comply with licensing requirements.

💡 Usage Tip

The NT - Java - 1.1B model is trained on publicly available datasets and has no safety guarantees. The generated code may be inefficient, contain bugs, or have security vulnerabilities. Conduct extensive safety testing and implement filtering mechanisms according to your specific needs.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご