Model Overview
Model Features
Model Capabilities
Use Cases
🚀 DeepHermes 3 - Llama-3 3B Preview
DeepHermes 3 - Llama-3 3B Preview is a cutting - edge LLM model from Nous Research. It unifies reasoning and normal response modes, and has improvements in annotation, judgement, and function calling.
✨ Features
Model Description
DeepHermes 3 Preview is the latest version of the Hermes series of LLMs by Nous Research. It's one of the first models globally to integrate Reasoning (long chains of thought for better answer accuracy) and normal LLM response modes. It also enhances LLM annotation, judgement, and function calling.
It's a hybrid reasoning model, combining both "intuitive", traditional mode responses and long chain of thought reasoning responses, switchable via a system prompt.
Hermes 3, the predecessor of DeepHermes 3, is a generalist language model with numerous improvements over Hermes 2, such as advanced agentic capabilities, better role - playing, reasoning, multi - turn conversation, long - context coherence, and overall enhancements.
The Hermes series of models aims to align LLMs with users, providing powerful steering capabilities and control to end - users.
This is a preview Hermes with early reasoning capabilities, distilled from R1 across various tasks that benefit from reasoning and objectivity. Some quirks may be found! Please share any interesting findings or issues you discover!
Benchmarks
- Reasoning Benchmarks: With Reasoning ON and OFF.
- Comparison to Llama - 3.2 - 3B - Instruct:
Prompt Format
DeepHermes 3 uses the Llama - Chat format as the prompt format, enabling a more unified and structured system for multi - turn chat dialogues with the LLM. System prompts offer steerability and new ways to interact with the LLM, guiding the model's rules, roles, and stylistic choices.
Deep Thinking Mode
Deep Hermes Preview can activate long chain of thought with a system prompt:
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import flash_attn
import time
tokenizer = AutoTokenizer.from_pretrained("NousResearch/DeepHermes-3-Llama-3-3B-Preview")
model = AutoModelForCausalLM.from_pretrained(
"NousResearch/DeepHermes-3-Llama-3-3B-Preview",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "system",
"content": "You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem."
},
{
"role": "user",
"content": "What is y if y=2*2-4+(3*2)"
}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=2500, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
print(f"Generated Tokens: {generated_ids.shape[-1:]}")
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_space=True)
print(f"Response: {response}")
Note: For difficult problems, DeepHermes can think using up to 13,000 tokens. You may need to increase max_new_tokens
to be much larger than 2500 for difficult problems.
Advanced Usage - Standard "Intuitive" Response Mode
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import flash_attn
import time
tokenizer = AutoTokenizer.from_pretrained("NousResearch/DeepHermes-3-Llama-3-3B-Preview")
model = AutoModelForCausalLM.from_pretrained(
"NousResearch/DeepHermes-3-Llama-3-3B-Preview",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="flash_attention_2",
)
messages = [
{
"role": "system",
"content": "You are Hermes, an AI assistant"
},
{
"role": "user",
"content": "What are the most interesting things to do in Paris?"
}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=2500, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
print(f"Generated Tokens: {generated_ids.shape[-1:]}")
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_space=True)
print(f"Response: {response}")
VLLM Inference
After pip install vllm
, you can run the model with vLLM in your terminal:
vllm serve NousResearch/DeepHermes-3-Llama-3-3B-Preview
You can then use the model over API using the OpenAI library, just like calling OpenAI's API.
Prompt Format for Function Calling
The model was trained on specific system prompts and structures for Function Calling. You should use the system role with a specific message, followed by a function signature json.
<|start_header_id|>system<|end_header_id|>
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools: <tools> {"type": "function", "function": {"name": "get_stock_fundamentals", "description": "get_stock_fundamentals(symbol: str) -> dict - Get fundamental data for a given stock symbol using yfinance API.\\n\\n Args:\\n symbol (str): The stock symbol.\\n\\n Returns:\\n dict: A dictionary containing fundamental data.\\n Keys:\\n - \'symbol\': The stock symbol.\\n - \'company_name\': The long name of the company.\\n - \'sector\': The sector to which the company belongs.\\n - \'industry\': The industry to which the company belongs.\\n - \'market_cap\': The market capitalization of the company.\\n - \'pe_ratio\': The forward price-to-earnings ratio.\\n - \'pb_ratio\': The price-to-book ratio.\\n - \'dividend_yield\': The dividend yield.\\n - \'eps\': The trailing earnings per share.\\n - \'beta\': The beta value of the stock.\\n - \'52_week_high\': The 52-week high price of the stock.\\n - \'52_week_low\': The 52-week low price of the stock.", "parameters": {"type": "object", "properties": {"symbol": {"type": "string"}}, "required": ["symbol"]}}} </tools> Use the following pydantic model json schema for each tool call you will make: {"properties": {"arguments": {"title": "Arguments", "type": "object"}, "name": {"title": "Name", "type": "string"}}, "required": ["arguments", "name"], "title": "FunctionCall", "type": "object"} For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{"arguments": <args-dict>, "name": <function-name>}
</tool_call><|eot_id|><|start_header_id|>user<|end_header_id|>
To complete the function call, create a user prompt after the system prompt. After the model generates a tool call, parse it, call the API, get the returned values, and pass them back as a new role, tool
.
Prompt Format for JSON Mode / Structured Outputs
The model was trained on a specific system prompt for Structured Outputs, which should respond with only a json object response in a specific json schema.
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant that answers in JSON. Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|eot_id|>
Given the {schema}, just give a typical user prompt, and it will respond in JSON.
Inference Code for Function Calling
All code for utilizing, parsing, and building function calling templates is available on our GitHub: https://github.com/NousResearch/Hermes-Function-Calling
Quantized Versions
GGUF Quants: https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-3B-Preview-GGUF
How to cite
@misc{
title={DeepHermes 3 Preview},
author={Teknium and Roger Jin and Chen Guang and Jai Suphavadeeprasit and Jeffrey Quesnelle},
year={2025}
}
📄 License
The license of this model is llama3.
📋 Information Table
Property | Details |
---|---|
Base Model | meta-llama/Meta-Llama-3.2-3B |
Library Name | transformers |
Tags | Llama - 3, instruct, finetune, chatml, gpt4, synthetic data, distillation, function calling, json mode, axolotl, roleplaying, chat, reasoning, r1, vllm |
Model Name | DeepHermes-3-Llama-3.1-3B |

