Refact-1_6B-fim-GGUF Open-Source Code Generation Large Language Model - Handles Codes in Multiple Programming Languages

Refact 1 6B Fim GGUF

Developed by Mungert

Refact-1.6B is a large language model with 1.6B parameters, specializing in code generation and excelling across multiple programming languages.

Large Language Model Supports Multiple LanguagesOpen Source License:Openrail #Multilingual Code Generation #High Pass Rate Programming #Code Completion Optimization

Downloads 765

Release Time : 3/17/2025

Model Overview

This model is primarily used for code generation, completion, and explanation tasks, supporting various programming languages and demonstrating outstanding performance in benchmarks like HumanEval.

Model Features

Multilingual Code Generation

Supports code generation in various programming languages including Python, JavaScript, Java, C++, and more.

High Performance

Demonstrates excellent performance in the HumanEval benchmark, with a first-pass rate of 32% for Python.

Extensive Training Data

Trained on diverse datasets including GitHub code, technical forums, Wikipedia, and more.

Versatile Code Processing

Capable of not only generating code but also explaining code, fixing tests, and generating documentation.

Model Capabilities

Code Generation

Code Completion

Code Explanation

Test Fixing

Documentation Generation

Multilingual Support

Use Cases

Software Development

Function Auto-Generation

Automatically generates implementation code based on function signatures.

Achieves a first-pass rate of 32% in the HumanEval test.

Code Completion

Provides intelligent code completion suggestions in IDEs.

Education

Programming Learning Assistance

Explains code functionality to help beginners understand.

Performs well in the HumanEvalExplain task.

🚀 Refact-1.6B

This is the Refact-1.6B model, which offers high - performance code generation and chat capabilities. After fine - tuning on generated data, it outperforms many other models in code - related tasks, and also shows good performance in chat scenarios.

🚀 Quick Start

Code Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "smallcloudai/Refact-1_6B-fim"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

prompt = '<fim_prefix>def print_hello_world():\n    """<fim_suffix>\n    print("Hello world!")<fim_middle>'

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100, temperature=0.2)
print("-"*80)
print(tokenizer.decode(outputs[0]))

Chat Usage

prompt_template = "<empty_output>SYSTEM {system}\n" \
                  "<empty_output>USER {query}\n" \
                  "<empty_output>ASSISTANT"
prompt = prompt_template.format(system="You are a programming assistant",
                                query="How do I sort a list in Python?")

✨ Features

High - performance Code Generation: After fine - tuning, it beats many other models such as Replit 3b, Stability Code 3b in code generation tasks.
Chat Capability: Can be used as a chat model, and shows good performance in comparison with other chat - specialized models.
Multi - language Support: Although trained mainly on English text, it has exposure to multiple languages in code comments.
Fill - in - the - Middle (FIM): Supports the FIM feature, which is useful for code completion in specific scenarios.

📦 Installation

There is no specific installation command provided in the original document. If you want to use the model, you can install the necessary libraries as shown in the quick start code, for example:

pip install -q transformers

💻 Usage Examples

Basic Usage

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "smallcloudai/Refact-1_6B-fim"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

prompt = '<fim_prefix>def print_hello_world():\n    """<fim_suffix>\n    print("Hello world!")<fim_middle>'

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100, temperature=0.2)
print("-"*80)
print(tokenizer.decode(outputs[0]))

Advanced Usage

prompt_template = "<empty_output>SYSTEM {system}\n" \
                  "<empty_output>USER {query}\n" \
                  "<empty_output>ASSISTANT"
prompt = prompt_template.format(system="You are a programming assistant",
                                query="How do I sort a list in Python?")

📚 Documentation

Model Architecture

As described in more detail in the blog post, we used:

ALiBi based attention
LayerNorm instead of RMSNorm
Multi Query Attention We also used LiON, flash attention, early dropout.

Pretraining

For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets. Filtering is the key to success of this model:

We only used text in English.
Only topics related to computer science.
Applied heavy deduplication. The text to code proportion was 50:50, model trained for 1.2T tokens.

Finetuning

We tested our hypothesis that chat data should boost base model performance in FIM and regular left - to - right code completion. We found that just 15% of open code instruction - following datasets, that we filtered for quality, improves almost all metrics. Additionally, to improve FIM, we observed common failure modes, and prepared a synthetic dataset based on The Stack dedup v1.1 to address them.

🔧 Technical Details

Ultra - Low - Bit Quantization with IQ - DynamicGate (1 - 2 bit)

Our latest quantization method introduces precision - adaptive quantization for ultra - low - bit models (1 - 2 bit), with benchmark - proven improvements on Llama - 3 - 8B. This approach uses layer - specific strategies to preserve accuracy while maintaining extreme memory efficiency.

Benchmark Context

All tests conducted on Llama - 3 - 8B - Instruct using:

Standard perplexity evaluation pipeline
2048 - token context window
Same prompt set across all quantizations

Method

Dynamic Precision Allocation:
- First/Last 25% of layers → IQ4_XS (selected layers)
- Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)
Critical Component Protection:
- Embeddings/output layers use Q5_K
- Reduces error propagation by 38% vs standard 1 - 2bit

Quantization Performance Comparison (Llama - 3 - 8B)

Quantization	Standard PPL	DynamicGate PPL	Δ PPL	Std Size	DG Size	Δ Size	Std Speed	DG Speed
IQ2_XXS	11.30	9.84	-12.9%	2.5G	2.6G	+0.1G	234s	246s
IQ2_XS	11.72	11.63	-0.8%	2.7G	2.8G	+0.1G	242s	246s
IQ2_S	14.31	9.02	-36.9%	2.7G	2.9G	+0.2G	238s	244s
IQ1_M	27.46	15.41	-43.9%	2.2G	2.5G	+0.3G	206s	212s
IQ1_S	53.07	32.00	-39.7%	2.1G	2.4G	+0.3G	184s	209s
Key:

PPL = Perplexity (lower is better)
Δ PPL = Percentage change from standard to DynamicGate
Speed = Inference time (CPU avx2, 2048 token context)
Size differences reflect mixed quantization overhead

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

A 16 - bit floating - point format designed for faster computation while retaining good precision.
Provides similar dynamic range as FP32 but with lower memory usage.
Recommended if your hardware supports BF16 acceleration (check your device's specs).
Ideal for high - performance inference with reduced memory footprint compared to FP32.

F16 (Float 16) – More widely supported than BF16

A 16 - bit floating - point high precision but with less of range of values than BF16.
Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
Slightly lower numerical precision than BF16 but generally sufficient for inference.

Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low - VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

Lower - bit models (Q4_K) → Best for minimal memory usage, may have lower precision.
Higher - bit models (Q6_K, Q8_0) → Better accuracy, requires more memory.

Very Low - Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These models are optimized for extreme memory efficiency, making them ideal for low - power devices or large - scale deployments where memory is a critical constraint.

IQ3_XS: Ultra - low - bit quantization (3 - bit) with extreme memory efficiency.
- Use case: Best for ultra - low - memory devices where even Q4_K is too large.
- Trade - off: Lower accuracy compared to higher - bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low - memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low - memory devices where IQ3_S is too limiting.
Q4_K: 4 - bit quantization with block - wise optimization for better accuracy.
- Use case: Best for low - memory devices where Q6_K is too large.
Q4_0: Pure 4 - bit quantization, optimized for ARM devices.
- Use case: Best for ARM - based devices or low - memory environments.

Summary Table: Model Format Selection

Model Format	Precision	Memory Usage	Device Requirements	Best Use Case
BF16	Highest	High	BF16 - supported GPU/CPUs	High - speed inference with reduced memory
F16	High	High	FP16 - supported devices	GPU inference when BF16 isn't available
Q4_K	Medium Low	Low	CPU or Low - VRAM devices	Best for memory - constrained environments
Q6_K	Medium	Moderate	CPU with more memory	Better accuracy while still being quantized
Q8_0	High	Moderate	CPU or GPU with enough VRAM	Best accuracy among quantized models
IQ3_XS	Very Low	Very Low	Ultra - low - memory devices	Extreme memory efficiency and low accuracy
Q4_0	Low	Low	ARM or low - memory devices	llama.cpp can optimize for ARM devices

📄 License

The model is licensed under the BigScience OpenRAIL - M v1 license agreement.

Model Stats

Property	Details
Model Type	LLAMA - like model with multi - query attention
Objectives	Fill - in - the - Middle, Chat
Tokens context	4096
Pretraining tokens	1.2T
Finetuning tokens	40B
Precision	bfloat16
GPUs	64 NVidia A5000
Training time	28 days

⚠️ Important Note

The Refact - 1.6B model was trained on text in English. Its performance on non - English languages is lower.

💡 Usage Tip

If you want to test the AI - Powered Network Monitor Assistant, click the chat icon (bottom right on any page), choose an AI assistant type (TurboLLM, FreeLLM, TestLLM), and try some example AI commands like "Give me info on my websites SSL certificate", "Check if my server is using quantum safe encyption for communication", "Run a quick Nmap vulnerability test".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご