Minitron-8B-Base Open-Source Large Language Model - Low-Cost Training, Efficiently Empowering Intelligent Interaction

Minitron 8B Base

Developed by nvidia

Minitron-8B-Base is a large language model obtained by pruning Nemotron-4 15B, employing distillation and continuous training methods, saving 40 times the training tokens and 1.8 times the computational cost compared to training from scratch.

Large Language Model

Transformers

EnglishOpen Source License:Other #Pruning and Distillation Optimization #Multi-domain Text Generation #GQA Efficient Inference

Downloads 5,725

Release Time : 7/19/2024

Model Overview

Minitron-8B-Base is an efficient large language model derived from the Nemotron-4 15B model through pruning and distillation techniques, primarily used for text generation tasks.

Model Features

Efficient Training

Saves 40 times the training tokens and 1.8 times the computational cost compared to training from scratch.

High Performance

Demonstrates up to a 16% improvement in MMLU scores, with performance comparable to community models like Mistral 7B, Gemma 7B, and Llama-3 8B.

Advanced Architecture

Incorporates advanced techniques such as Grouped Query Attention (GQA) and Rotary Position Embedding (RoPE).

Model Capabilities

Text Generation

Language Understanding

Code Generation

Use Cases

Natural Language Processing

Text Completion

Generates coherent subsequent text based on given prompts.

Produces fluent and semantically coherent text.

Question Answering

Answers user-provided questions.

Provides accurate and relevant answers.

Code Generation

Code Completion

Generates subsequent code based on given code snippets.

Produces functionally correct code snippets.

🚀 Minitron-8B-Base

Minitron-8B-Base is a large language model derived from pruning Nemotron-4 15B, offering significant computational savings and comparable performance to other community models.

🚀 Quick Start

Support for this model will be added in the upcoming transformers release. In the meantime, you can install the library from source:

pip install git+https://github.com/huggingface/transformers

The following is an example of how to load the Minitron-8B model and use it for text generation:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
model_path = "nvidia/Minitron-8B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path)

device='cuda'
dtype=torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)

# Prepare the input text
prompt = "To be or not to be,"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

# Generate the output
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print the output
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

✨ Features

Efficient Training: Deriving the Minitron 8B and 4B models from the base 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch, resulting in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B).
Good Performance: Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature.

📦 Installation

Support for this model will be added in the upcoming transformers release. In the meantime, please install the library from source:

pip install git+https://github.com/huggingface/transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
model_path = "nvidia/Minitron-8B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path)

device='cuda'
dtype=torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)

# Prepare the input text
prompt = "To be or not to be,"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

# Generate the output
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print the output
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

📚 Documentation

Model Overview

Minitron-8B-Base is a large language model (LLM) obtained by pruning Nemotron-4 15B. Specifically, we prune model embedding size, number of attention heads, and MLP intermediate dimension. After pruning, we perform continued training with distillation using 94 billion tokens to get the final model. We use the continuous pre-training data corpus used in Nemotron-4 15B for this.

This model is for research and development only.

Model Developer: NVIDIA

Model Dates: Minitron-8B-Base was trained between February 2024 and June 2024.

Model Architecture

Property	Details
Model Type	Transformer Decoder (auto-regressive language model)
Network Architecture	Nemotron-4
Input Type	Text
Input Format	String
Input Parameters	None
Other Properties Related to Input	None
Output Type	Text
Output Format	String
Output Parameters	None
Other Properties Related to Output	None
Model Embedding Size	4096
Attention Heads	48
MLP Intermediate Dimension	16384
Attention Mechanism	Grouped-Query Attention (GQA)
Position Embeddings	Rotary Position Embeddings (RoPE)

Dataset & Training

Property	Details
Data Collection Method	Hybrid
Labeling Method	Not Applicable
Data Properties	The training corpus consists of English and multilingual text, as well as code. Sources cover various document types like webpages, dialogue, articles, etc. The corpus spans domains including legal, math, science, finance, etc. A small portion of question-answering and alignment style data is added in the continued training set.
Data Freshness	The pretraining data has a cutoff of June 2023.

Evaluation Results

5-shot performance. Language Understanding evaluated using Massive Multitask Language Understanding:

Average
64.5

Zero-shot performance. Evaluated using select datasets from the LM Evaluation Harness with additions:

HellaSwag	Winogrande	GSM8K	ARC-C	XLSum
81.6	80.3	54.2	49.2	31.1

Code generation performance. Evaluated using HumanEval:

p@1, 0-Shot
31.6

Please refer to our paper for the full set of results.

Inference

Property	Details
Engine	TensorRT-LLM
Test Hardware	NVIDIA A100
DType	Float16/BFloat16

Limitations

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

If you find our work helpful, please consider citing our paper:

@article{minitron2024,
      title={Compact Language Models via Pruning and Knowledge Distillation}, 
      author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
      journal={arXiv preprint arXiv:2407.14679},
      year={2024},
      url={https://arxiv.org/abs/2407.14679}, 
}

📄 License

Minitron-8B-Base is released under the NVIDIA Open Model License Agreement.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご