Model Overview
Model Features
Model Capabilities
Use Cases
đ OpenCodeReasoning-Nemotron-32B-IOI GGUF Models
The OpenCodeReasoning-Nemotron-32B-IOI GGUF models are large language models (LLMs) derived from Qwen2.5-32B-Instruct. They are designed for code generation with reasoning capabilities and support a context length of 32K tokens. These models offer various quantization options to suit different hardware and memory requirements, enabling efficient deployment in diverse scenarios.
đ Quick Start
Prerequisites
- Install the
transformers
library:
pip install transformers
- Ensure your hardware supports the desired model format (e.g., BF16, F16, quantized formats).
Running Inference
Here are examples of running inference on coding problems for the IOI Benchmark and Python programs:
IOI Benchmark (C++)
import transformers
import torch
model_id = "nvidia/OpenCodeReasoning-Nemotron-32B-IOI"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
prompt = """You are a helpful and harmless assistant. You should think step-by-step before responding to the instruction below.
Please use c++ programming language only.
You must use ```cpp for just the final solution code block with the following format:
```cpp
// Your code here
{user} """
messages = [ { "role": "user", "content": prompt.format(user="Write a program to calculate the sum of the first $N$ fibonacci numbers") }, ]
outputs = pipeline( messages, max_new_tokens=32768, ) print(outputs[0]["generated_text"][-1]['content'])
#### Python Programs
```python
import transformers
import torch
model_id = "nvidia/OpenCodeReasoning-Nemotron-32B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
prompt = """You are a helpful and harmless assistant. You should think step-by-step before responding to the instruction below.
Please use python programming language only.
You must use ```python for just the final solution code block with the following format:
```python
# Your code here
{user} """
messages = [ { "role": "user", "content": prompt.format(user="Write a program to calculate the sum of the first $N$ fibonacci numbers") }, ]
outputs = pipeline( messages, max_new_tokens=32768, ) print(outputs[0]["generated_text"][-1]['content'])
## ⨠Features
### Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)
Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
### Multiple Model Formats
- **BF16 (Brain Float 16)**: A 16-bit floating-point format designed for faster computation while retaining good precision. Ideal for high-performance inference with reduced memory footprint compared to FP32.
- **F16 (Float 16)**: A 16-bit floating-point format with high precision and wide device support. Suitable for GPU inference when BF16 isn't available.
- **Quantized Models (Q4_K, Q6_K, Q8_0, IQ3_XS, etc.)**: Offer a range of precision and memory usage trade-offs, making them suitable for various hardware and memory constraints.
### AI Network Monitoring Capabilities
The models are being tested for AI network monitoring tasks, including function calling against live network services, automated Nmap scans, quantum-readiness checks, and network monitoring tasks.
## đĻ Installation
The model can be installed using the `transformers` library. Refer to the "Quick Start" section for installation and usage instructions.
## đģ Usage Examples
### Running Inference on Coding Problems
See the code examples in the "Quick Start" section for running inference on coding problems for the IOI Benchmark and Python programs.
## đ Documentation
### Model Generation Details
This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`92ecdcc0`](https://github.com/ggerganov/llama.cpp/commit/92ecdcc06a4c405a415bcaa0cb772bc560aa23b1).
### Ultra-Low-Bit Quantization
#### Benchmark Context
All tests were conducted on Llama-3-8B-Instruct using a standard perplexity evaluation pipeline, a 2048-token context window, and the same prompt set across all quantizations.
#### Method
- **Dynamic Precision Allocation**:
- First/Last 25% of layers â IQ4_XS (selected layers)
- Middle 50% â IQ2_XXS/IQ3_S (increase efficiency)
- **Critical Component Protection**:
- Embeddings/output layers use Q5_K
- Reduces error propagation by 38% vs standard 1-2bit
#### Quantization Performance Comparison (Llama-3-8B)
| Quantization | Standard PPL | DynamicGate PPL | Î PPL | Std Size | DG Size | Î Size | Std Speed | DG Speed |
|--------------|--------------|------------------|-------|----------|---------|--------|-----------|----------|
| IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
| IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
| IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
| IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
| IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
### Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|--------------|-----------|--------------|---------------------|---------------|
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
### Included Files & Details
- `OpenCodeReasoning-Nemotron-32B-IOI-bf16.gguf`: Model weights preserved in BF16. Use this if you want to requantize the model into a different format.
- `OpenCodeReasoning-Nemotron-32B-IOI-f16.gguf`: Model weights stored in F16. Use if your device supports FP16, especially if BF16 is not available.
- `OpenCodeReasoning-Nemotron-32B-IOI-bf16-q8_0.gguf`: Output & embeddings remain in BF16. All other layers quantized to Q8_0.
- `OpenCodeReasoning-Nemotron-32B-IOI-f16-q8_0.gguf`: Output & embeddings remain in F16. All other layers quantized to Q8_0.
- `OpenCodeReasoning-Nemotron-32B-IOI-q4_k.gguf`: Output & embeddings quantized to Q8_0. All other layers quantized to Q4_K. Good for CPU inference with limited memory.
- `OpenCodeReasoning-Nemotron-32B-IOI-q4_k_s.gguf`: Smallest Q4_K variant, using less memory at the cost of accuracy. Best for very low-memory setups.
- `OpenCodeReasoning-Nemotron-32B-IOI-q6_k.gguf`: Output & embeddings quantized to Q8_0. All other layers quantized to Q6_K.
- `OpenCodeReasoning-Nemotron-32B-IOI-q8_0.gguf`: Fully Q8 quantized model for better accuracy. Requires more memory but offers higher precision.
- `OpenCodeReasoning-Nemotron-32B-IOI-iq3_xs.gguf`: IQ3_XS quantization, optimized for extreme memory efficiency. Best for ultra-low-memory devices.
- `OpenCodeReasoning-Nemotron-32B-IOI-iq3_m.gguf`: IQ3_M quantization, offering a medium block size for better accuracy. Suitable for low-memory devices.
- `OpenCodeReasoning-Nemotron-32B-IOI-q4_0.gguf`: Pure Q4_0 quantization, optimized for ARM devices. Best for ARM-based devices or low-memory environments.
### Testing the AI Network Monitor Assistant
You can help test the AI-Powered Network Monitor Assistant with quantum-ready security checks using the following link: [Free Network Monitor](https://readyforquantum.com/dashboard/?assistant=open&utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme)
Choose an AI assistant type:
- `TurboLLM` (GPT-4o-mini)
- `HugLLM` (Hugginface Open-source)
- `TestLLM` (Experimental CPU-only)
Example commands to test:
1. `"Give me info on my websites SSL certificate"`
2. `"Check if my server is using quantum safe encyption for communication"`
3. `"Run a comprehensive security audit on my server"`
4. `"Create a cmd processor to .. (what ever you want)"`
## đ§ Technical Details
### Model Architecture
Architecture Type: Dense decoder-only Transformer model
Network Architecture: Qwen-32B-Instruct
This model was developed based on Qwen2.5-32B-Instruct and has 32B model parameters.
### Input and Output
- **Input**:
- Input Type(s): Text
- Input Format(s): String
- Input Parameters: One-Dimensional (1D)
- Other Properties Related to Input: Context length up to 32,768 tokens
- **Output**:
- Output Type(s): Text
- Output Format: String
- Output Parameters: One-Dimensional (1D)
- Other Properties Related to Output: Context length up to 32,768 tokens
### Software Integration
- Runtime Engine: NeMo 2.3.0
- Recommended Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper
- Preferred/Supported Operating System(s): Linux
### Model Version(s)
1.0 (4/25/2025)
OpenCodeReasoning-Nemotron-7B
OpenCodeReasoning-Nemotron-14B
OpenCodeReasoning-Nemotron-32B
OpenCodeReasoning-Nemotron-32B-IOI
### Training and Evaluation Datasets
- **Training Dataset**: The training corpus for OpenCodeReasoning-Nemotron-32B is [OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning) dataset, which is composed of competitive programming questions and DeepSeek-R1 generated responses.
- **Evaluation Dataset**: The datasets listed in the next section were used to evaluate OpenCodeReasoning-Nemotron-32B.
### Inference
- **Engine**: vLLM
- **Test Hardware**: NVIDIA H100-80GB
## đ License
The use of this model is governed by [Apache 2.0](https://huggingface.co/nvidia/OpenCode-Nemotron-2-7B/blob/main/LICENSE).
## Citation
If you find the data useful, please cite:
@article{ahmad2025opencodereasoning, title={OpenCodeReasoning: Advancing Data Distillation for Competitive Coding}, author={Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg}, year={2025}, eprint={2504.01943}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.01943}, }

