Hymba-1.5B-Instruct Open-Source Model - Freely Handle Complex Tasks Such as Mathematical Reasoning and Role-Playing

Hymba 1.5B Instruct

Developed by nvidia

A 1.5B-parameter model fine-tuned for instructions based on Hymba-1.5B-Base, capable of handling complex tasks such as mathematical reasoning, function calling, and role-playing

Large Language Model

Transformers

Open Source License:Other #Hybrid Attention Architecture #Instruction Fine-Tuning Optimization #Multi-Task Processing

Downloads 3,547

Release Time : 10/31/2024

Model Overview

An instruction-tuned model trained on a combination of open-source instruction datasets and internally synthesized data, using supervised fine-tuning and direct preference optimization

Model Features

Hybrid Attention Architecture

Each layer integrates standard attention heads and Mamba state-space model heads in parallel, enhancing long-sequence processing capabilities

Meta-Token Technology

Prepend tokens enable global interaction, mitigating the 'forced attention' issue in traditional attention mechanisms

Efficient Design

Combines Grouped Query Attention (GQA), Rotary Position Embedding (RoPE), and cross-layer KV sharing techniques

Business-Friendly License

Uses the NVIDIA Open Model License, permitting commercial use

Model Capabilities

Mathematical Reasoning

Function Calling

Role-Playing

Multi-Turn Dialogue

Text Generation

Instruction Understanding

Use Cases

Intelligent Assistants

Task-Oriented Dialogue Systems

Handles complex user requests involving multi-step operations

Outperforms same-scale models by 15% in SFT benchmark tests

Educational Applications

Math Problem Tutoring

Provides step-by-step explanations for mathematical problem-solving

Achieves 62.3% accuracy on the GSM8K test set

🚀 Hymba-1.5B-Instruct

Hymba-1.5B-Instruct is a 1.5B parameter model fine-tuned from Hymba-1.5B-Base. It combines open - source instruction datasets and internally collected synthetic datasets. This model can handle complex tasks like math reasoning, function calling, and role - playing, and is ready for commercial use.

🚀 Quick Start

Environment Setup

Since Hymba-1.5B-Instruct uses FlexAttention, which depends on Pytorch2.5 and other related dependencies, there are two ways to set up the environment:

[Local install] Install related packages using the provided setup.sh (supports CUDA 12.1/12.4):

wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
bash setup.sh

[Docker] A Docker image with all Hymba's dependencies installed is available. Download the Docker image and start a container with the following commands:

docker pull ghcr.io/tilmto/hymba:v1
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash

Chat with Hymba-1.5B-Instruct

After setting up the environment, use the following script to chat with the model:

from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
import torch

# Load the tokenizer and model
repo_name = "nvidia/Hymba-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

# Chat with Hymba
prompt = input()

messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]
messages.append({"role": "user", "content": prompt})

# Apply chat template
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
outputs = model.generate(
    tokenized_chat, 
    max_new_tokens=256,
    do_sample=False,
    temperature=0.7,
    use_cache=True,
    stopping_criteria=stopping_criteria
)
input_length = tokenized_chat.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)

print(f"Model response: {response}")

The prompt template used by Hymba-1.5B-Instruct is as follows, which has been integrated into the tokenizer and can be applied using tokenizer.apply_chat_template:

<extra_id_0>System
{system prompt}

<extra_id_1>User
<tool> ... </tool>
<context> ... </context>
{prompt}
<extra_id_1>Assistant
<toolcall> ... </toolcall>
<extra_id_1>Tool
{tool response}
<extra_id_1>Assistant\n

✨ Features

Task - handling capabilities: Hymba-1.5B-Instruct can perform complex tasks such as math reasoning, function calling, and role - playing.
Commercial readiness: This model is ready for commercial use.
Performance advantage: It outperforms popular small language models and achieves the highest average performance across all tasks.

📦 Installation

Environment Setup

Local install:

wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
bash setup.sh

Docker:

docker pull ghcr.io/tilmto/hymba:v1
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
import torch

# Load the tokenizer and model
repo_name = "nvidia/Hymba-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

# Chat with Hymba
prompt = input()

messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]
messages.append({"role": "user", "content": prompt})

# Apply chat template
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
outputs = model.generate(
    tokenized_chat, 
    max_new_tokens=256,
    do_sample=False,
    temperature=0.7,
    use_cache=True,
    stopping_criteria=stopping_criteria
)
input_length = tokenized_chat.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)

print(f"Model response: {response}")

📚 Documentation

Model Overview

Hymba-1.5B-Instruct is a 1.5B parameter model fine-tuned from Hymba-1.5B-Base using a combination of open - source instruction datasets and internally collected synthetic datasets. It is fine-tuned with supervised fine - tuning and direct preference optimization.

Model Developer: NVIDIA Model Dates: Hymba-1.5B-Instruct was trained between September 4, 2024 and November 10th, 2024. License: This model is released under the NVIDIA Open Model License Agreement.

Model Architecture

Hymba-1.5B-Instruct has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, and the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped - Query Attention (GQA) and Rotary Position Embeddings (RoPE).

Features of this architecture:

Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced - to - attend" in attention.
Integrate with cross - layer KV sharing and global - local attention to further boost memory and computation efficiency.

Performance Highlights

Hymba-1.5B-Instruct outperforms popular small language models and achieves the highest average performance across all tasks. Compare with SoTA Small LMs

Finetuning Hymba

LMFlow is a complete pipeline for fine - tuning large language models. The following steps provide an example of how to fine - tune the Hymba-1.5B-Base models using LMFlow:

Using Docker

docker pull ghcr.io/tilmto/hymba:v1
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash

Install LMFlow

git clone https://github.com/OptimalScale/LMFlow.git
cd LMFlow
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .

Fine - tune the model using the following command.

cd LMFlow
bash ./scripts/run_finetune_hymba.sh

With LMFlow, you can also fine - tune the model on your custom dataset. You only need to transform your dataset into the LMFlow data format. In addition to full - finetuning, you can also fine - tune hymba efficiently with DoRA, LoRA, LISA, Flash Attention, and other acceleration techniques. For more details, please refer to the LMFlow for Hymba documentation.

Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, we recommend strong output validation controls to ensure security and safety risks from user - controlled model outputs are consistent with the intended use cases.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).

Citation

@misc{dong2024hymbahybridheadarchitecturesmall,
      title={Hymba: A Hybrid-head Architecture for Small Language Models}, 
      author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Lin and Jan Kautz and Pavlo Molchanov},
      year={2024},
      eprint={2411.13676},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.13676}, 
}

🔧 Technical Details

Model embedding size: 1600
Attention heads: 25
MLP intermediate dimension: 5504
Total layers: 32
SSM states: 16
Full attention layers: 3
Other attention layers: Sliding window attention
Attention combination: Hybrid combination of standard attention heads and Mamba heads in parallel
Other techniques: Grouped - Query Attention (GQA) and Rotary Position Embeddings (RoPE)

📄 License

This model is released under the NVIDIA Open Model License Agreement.

⚠️ Important Note

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, strong output validation controls are recommended to ensure security and safety risks from user - controlled model outputs are consistent with the intended use cases.

💡 Usage Tip

When using this model, especially in a RAG or agentic setting, use strong output validation controls to ensure the security and safety of the model outputs.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご