Apriel-Nemotron-15b-Thinker Open-source Inference Model - Efficient Memory Usage, Suitable for Multiple Scenarios

Apriel Nemotron 15b Thinker GGUF

Developed by Mungert

Apriel-Nemotron-15b-Thinker is a powerful inference model that performs excellently among models of the same scale. It has efficient memory usage and excellent inference capabilities, making it suitable for various enterprise and academic scenarios.

Large Language Model

Transformers

Open Source License:MIT #Efficient Inference #Enterprise-level Tasks #Math Competition Level

Downloads 1,097

Release Time : 6/12/2025

Model Overview

Apriel-Nemotron-15b-Thinker is an efficient inference model suitable for enterprise and academic scenarios, with outstanding inference capabilities and memory efficiency.

Model Features

Memory-efficient

The model size is only half of similar SOTA models, with high memory usage efficiency.

Token-efficient

Compared with similar models, it consumes 40% fewer tokens and is extremely efficient in the production environment.

Excellent Task Performance

It performs equally well or better on tasks such as MBPP, BFCL, Enterprise RAG, and MT Bench.

Competitive in Academic Benchmarks

It is competitive in academic benchmarks such as AIME-24, AIME-25, and AMC-23.

Model Capabilities

Text Generation

Logical Reasoning

Question Answering

Code Generation

Function Call

Complex Instruction Following

Use Cases

Enterprise Applications

Code Assistance and Generation

Help developers generate and optimize code.

Improve development efficiency and reduce coding errors.

Logical Reasoning and Multi-step Tasks

Solve complex logical reasoning problems.

Provide accurate reasoning results.

Academic Research

Mathematics and Science Problem Solving

Solve competition-level mathematics and science problems.

Perform excellently in exams such as AIME and AMC.

🚀 Apriel-Nemotron-15b-Thinker GGUF Models

The Apriel-Nemotron-15b-Thinker GGUF models offer advanced text generation capabilities. Built on a well - trained architecture, these models can handle various reasoning tasks efficiently, providing high - quality outputs for different applications.

📚 Documentation

✨ Features

High Performance: Achieves competitive results against similarly sized state - of - the - art models while having a smaller memory footprint.
Versatile Training: Trained through a three - stage pipeline (CPT, SFT, and GRPO) to enhance reasoning and instruction - following capabilities.
Quantization Experiment: Utilizes a new quantization approach to improve precision at lower bit depths.

📦 Installation

To use this model, you need to install the transformers library. You can do this using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

Here is a code snippet demonstrating the model's usage with the transformers library's generate function:

import re
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ServiceNow-AI/Apriel-Nemotron-15b-Thinker"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Positive real numbers $x$ and $y$ satisfy $y^3=x^2$ and $(y - x)^2=4y^2$. What is $x + y$?\nMark your solution with \\boxed"
messages = [
    {"role": "user", "content": prompt}
]

tools = []

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=65536
)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

# parsing the response
response = re.findall(r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL)[0].strip()
print("output:", output)
print("response:", response)

Chat Template

from transformers import AutoTokenizer
model_name = "ServiceNow-AI/Apriel-Nemotron-15b-Thinker"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepare the model input
custom_system_prompt = "Answer like a pirate."
prompt = "You are an expert assistant in the implementation of customer experience management aspect of retail applications \n \nYou will be using Python as the programming language. \n \nYou will utilize a factory design pattern for the implementation and following the dependency inversion principle \n \nYou will modify the implementation based on user requirements. \n \nUpon user request, you will add, update, and remove the features & enhancements in the implementation provided by you. \n \nYou will ask whether the user wants to refactor the provided code or needs a sample implementation for reference. Upon user confirmation, I will proceed accordingly. \n \n**Guidelines:** \n 1. **User Requirements:** \n - You have to ask users about their requirements, clarify the user expectations, and suggest the best possible solution by providing examples of Python code snippets. \n - Ask users about which type of reports they need to assess the AI model's performance, accuracy, and reliability. \n - After providing the solution, you have to ask the user about the trial of the solution and modify the solution based on the user feedback. \n \n 2. **Libraries/Frameworks:** \n - You will be utilizing Python as a programming language. \n - You will be using Flask framework for REST APIS implementation \n \n 3. **Communication Gesture:** \n - Your conversation with the user should be interactive, supportive, courageous, and professional. \n - You have to break down the complex concepts into sub - concepts and try to explain them to the user. \n - You have to ask the user for the required parameters. If the user refuses to provide in 2 attempts, politely exit the conversation. \n - You have to provide your supported parameters to the user, if the user refuses to accept them then you have to put an apology note and exit the conversation. \n - You have to track the conversation about unasked questions by the user. If some/one of the questions remain then you have to remind the user about these questions and proceed to answer them based on the user's confirmation \n \n 4. **Implementation:** \n - Your code/implementations should be reliable, scaleable, modular, and reusable. \n - You will be providing unit tests for the implementation upon user request. \n - You will be following MVC architecture for the applications \n - Your implementations must be well - commented and readable \n \n \n- Today's date is 23rd August 2024. \n- The default sender email is sender - assistant@email.com.\nHi, I am conducting research on retail customer feedback systems and I need assistance with designing and implementing them. Could you kindly provide me with a list of general customer feedback system modules?"
messages = [
    {"role": "user", "content": custom_system_prompt + "\n\n" + prompt}
]
# example tools
tools = [{"type": "function", "function": {"name": "getRetailFeedbackModules", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"page": {"type": "integer", "description": "The current page number.", "default": 1}, "page_size": {"type": "integer", "description": "The number of items per page.", "default": 3}}}}}, {"type": "function", "function": {"name": "verifyImplementation", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"coding_language": {"type": "string", "description": "The supported languages for verification of implementation.", "default": "python", "enum": ["python", "java", "php"]}, "code": {"type": "string", "description": "The code which needs verification"}, "design_pattern": {"type": "string", "description": "The design pattern to verify in the implementation", "enum": ["factory", "strategy", "singleton"]}, "verify_best_practices": {"type": "boolean", "description": "The verification of the coding style based on the language selected", "default": true}}}}}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    tools=tools
)
model_inputs = tokenizer([text], return_tensors="pt")

Usage Guidelines

Use the model's default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message.
We recommend setting temperature to 0.6.
We ensure the model starts with Here are my reasoning steps:\n during all our evaluations. This is implemented in the default chat template.

🔧 Technical Details

Model Generation Details

This model was generated using llama.cpp at commit 1f63e75f.

Quantization Beyond the IMatrix

The author has been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. Standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, the --tensor-type option in llama.cpp is used to manually "bump" important layers to higher precision. You can see the implementation here: Layer bumping with llama.cpp. While this does increase model file size, it significantly improves precision for a given quantization level.

Training Details

Mid training / Continual Pre - training: The model is trained on 100+ billion tokens of carefully curated examples drawn from mathematical reasoning, coding challenges, scientific discourse, and logical puzzles to strengthen foundational reasoning capabilities.
Supervised Fine - Tuning (SFT): The model is SFT using 200,000 high - quality demonstrations that cover mathematical and scientific problem - solving, coding tasks, generic instruction - following scenarios, API/function invocation use cases, etc.
Reinforcement Learning: GRPO (with some minor modifications to the objective) is applied to address weaknesses in instruction following and coding tasks. It results in significant improvement on benchmarks such as IFEval, Multi Challenge, Enterprise RAG, MBPP, and BFCL, while preserving scores on competition - level math exams like AIME and AMC. GRPO also yields modest gains on GPQA and MixEval. Throughout training, intermediate snapshots from both the SFT and GRPO stages are periodically merged to improve generalization and catastrophic forgetting.

Evaluation

Evaluations were conducted using [lm - eval - harness](https://github.com/EleutherAI/lm - evaluation - harness) and evalchemy.

Benchmarks that are indicative of enterprise capability

![image/png](https://cdn - uploads.huggingface.co/production/uploads/63d3095c2727d7888cbb54e2/ih9QGdnSq20MLCGGtsAZL.png)

Academic reasoning benchmarks

![image/png](https://cdn - uploads.huggingface.co/production/uploads/63d3095c2727d7888cbb54e2/psk628cePXQZ9AghKuI5u.png)

Token efficiency comparison (lower the better)

![image/png](https://cdn - uploads.huggingface.co/production/uploads/63d3095c2727d7888cbb54e2/jmvBwsJAIZVTP0xtWSTIz.png)

Intended Use

The Apriel family of models are designed for a variety of general - purpose instruction tasks, including code assistance and generation, logical reasoning and multi - step tasks, question answering and information retrieval, function calling, complex instruction following, and agent use cases. They are not intended for use in safety - critical applications without human oversight or in scenarios requiring guaranteed factual accuracy.

Limitations

Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts.
Bias: May reflect societal, cultural, or systemic biases present in training data.
Ethics: Do not use the model to produce harmful, unlawful, or unethical content.
Language: Strongest performance is in English. Output quality may degrade in underrepresented languages.
Critical use: Not suitable for medical, legal, financial, or other high - risk applications without safeguards.

Security and Responsible Use

Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF).

Guidelines for Deployers:

Regularly conduct robustness assessments to identify and mitigate adversarial inputs.
Implement validation and filtering processes to prevent harmful or biased outputs.
Continuously perform data privacy checks to guard against unintended data leaks.
Document and communicate the model's limitations, intended usage, and known security risks to all end - users.
Schedule periodic security reviews and updates to address emerging threats and vulnerabilities.

Guidelines for Users:

Follow established security policies and usage guidelines provided by deployers.
Protect and manage sensitive information when interacting with the model.
Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers.
Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions.

Disclaimer: Users accept responsibility for securely deploying, managing, and using this open - source LLM. The model is provided "as - is," without explicit or implied warranty regarding security or fitness for any specific application or environment.

Software

Training stack: [Fast - LLM](https://github.com/ServiceNow/Fast - LLM)

📄 License

MIT

Acknowledgments

We thank researchers at Nvidia for sharing detailed insights and data from their work in building reasoners! This greatly accelerated our research and we recognize the same with our model naming convention!

Citation

@misc{Apriel-nemotron-15b-thinker,  
    author = {Slam labs team},  
    title = {Apriel Nemotron 15b Thinker},  
    howpublished = {https://huggingface.co/ServiceNow-AI/Apriel-Nemotron-15b-Thinker},
    publisher = {SLAM - ServiceNow Language Models Lab}  
    year = {2025}
}

Test the AI Network Monitoring

The author is pushing the limits of small open - source models for AI network monitoring. You can help test the models with the following details:

What to Test

Function calling against live network services
How small can a model go while still handling automated Nmap security scans, Quantum - readiness checks, and Network Monitoring tasks

Available Assistants

TestLLM: Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space). It has zero - configuration setup, 30s+ load time (slow inference but no API costs), and no token limit as the cost is low. Help is wanted for edge - device AI collaboration.
TurboLLM: Uses gpt - 4.1 - mini. It performs well but has token usage limitations due to OpenAI's per - token charges. It can create custom cmd processors to run .net code on Quantum Network Monitor Agents, perform real - time network diagnostics and monitoring, security audits, and penetration testing (Nmap/Metasploit).
HugLLM: Runs on Hugging Face Inference API and performs well using the latest models hosted on Novita.

Example commands you could test

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!

Final Word

The author funds the servers used to create these model files, runs the Quantum Network Monitor service, and pays for inference from Novita and OpenAI out of their own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. If you appreciate the work, please consider buying the author a coffee. The author is also open to job opportunities or sponsorship.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご