Nemotron-Mini-4B-Instruct Open-Source Model - Optimize Role-Playing, Retrieval, and Function Call Response Generation

Nemotron Mini 4B Instruct

Developed by nvidia

Nemotron-Mini-4B-Instruct is a response generation model developed by NVIDIA, optimized for role-playing, retrieval-augmented generation, and function calling. It is fine-tuned based on Minitron-4B-Base and supports a context length of 4096 tokens.

Large Language Model

Transformers

EnglishOpen Source License:Other #Role-playing Optimization #On-device Deployment #Tool Calling Integration

Downloads 674

Release Time : 9/10/2024

Model Overview

A compact language model optimized through distillation, pruning, and quantization, excelling in speed and on-device deployment. It is specifically optimized for English scenarios in role-playing, RAG Q&A, and function calling.

Model Features

Efficient Deployment

Optimized through distillation, pruning, and quantization techniques, suitable for on-device deployment.

Multi-functional Optimization

Specifically optimized for role-playing, RAG Q&A, and function calling scenarios.

Long Context Support

Supports a context length of 4096 tokens.

Business-friendly

Open for commercial use under license.

Model Capabilities

Role-playing Dialogue

Retrieval-Augmented Generation

Function Calling

English Text Generation

Use Cases

Game Development

Game Character AI

Integrated into video games to provide intelligent dialogue for NPCs.

Refer to NVIDIA ACE demo video.

Smart Assistants

Personalized Chatbot

Create dialogue assistants with specific role styles.

Supports various role settings like pirate style.

Enterprise Applications

RAG Q&A System

Build a Q&A system based on retrieval-augmented generation.

🚀 Nemotron-Mini-4B-Instruct

Nemotron-Mini-4B-Instruct is a model designed for generating responses in roleplaying, retrieval augmented generation, and function calling scenarios. It's a small language model optimized for speed and on - device deployment, offering commercial usability.

🚀 Quick Start

You can try this model on build.nvidia.com. For more details about its use in NVIDIA ACE, refer to this blog post and this demo video. You can download the model checkpoint for NVIDIA AI Inference Manager (AIM) SDK from here.

✨ Features

Optimized for Multiple Scenarios: Ideal for roleplaying, retrieval augmented generation, and function calling.
Small and Fast: A small language model (SLM) optimized through distillation, pruning, and quantization for speed and on - device deployment.
Fine - Tuned: A fine - tuned version of nvidia/Minitron-4B-Base, which was pruned and distilled from Nemotron-4 15B using our LLM compression technique.
English - Optimized: Optimized for English in roleplay, RAG QA, and function calling.
Long Context Support: Supports a context length of 4,096 tokens.
Commercial Use Ready: The model is ready for commercial use.

📦 Installation

No specific installation steps are provided in the original README. However, you can use the following Python code to load the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

# Use the prompt template
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))

Advanced Usage

from transformers import AutoTokenizer
from transformers import pipeline

tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="nvidia/Nemotron-Mini-4B-Instruct")
pipe.tokenizer = tokenizer  # You need to assign tokenizer manually
pipe(messages)

📚 Documentation

Model Overview

Nemotron-Mini-4B-Instruct is a fine - tuned model. It was trained between February 2024 and Aug 2024. It is a small language model optimized for speed and on - device deployment, and it is ready for commercial use.

Prompt Format

We recommend using the following prompt template, which was used to fine - tune the model. The model may not perform optimally without it.

Single Turn

<extra_id_0>System
{system prompt}

<extra_id_1>User
{prompt}
<extra_id_1>Assistant\n

Tool use

<extra_id_0>System
{system prompt}

<tool> ... </tool>
<context> ... </context>

<extra_id_1>User
{prompt}
<extra_id_1>Assistant
<toolcall> ... </toolcall>
<extra_id_1>Tool
{tool response}
<extra_id_1>Assistant\n

Model Architecture

Property	Details
Model Type	Transformer Decoder (auto - regressive language model)
Network Architecture	Nemotron-4
Embedding Size	3072
Attention Heads	32
MLP Intermediate Dimension	9216
Attention Mechanism	Grouped - Query Attention (GQA)
Position Embeddings	Rotary Position Embeddings (RoPE)

AI Safety Efforts

The Nemotron-Mini-4B-Instruct model underwent AI safety evaluation including adversarial testing via three distinct methods:

Garak, an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage.
AEGIS, a content safety evaluation dataset and LLM based content safety classifier model, adhering to a broad taxonomy of 13 categories of critical risks in human - LLM interactions.
Human Content Red Teaming leveraging human interaction and evaluation of the models' responses.

Limitations

The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate inaccurate answers, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. This issue could be exacerbated without the use of the recommended prompt template.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++. Please report security vulnerabilities or NVIDIA AI Concerns here.

📄 License

NVIDIA Community Model License

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご