Llama-3.1-8B-Instruct-GGUF Open-Source Model - An Instruction Processing Tool with High Memory Efficiency and Accuracy

Llama 3.1 8B Instruct GGUF

Developed by Mungert

Llama-3.1-8B-Instruct is an instruction-tuned version based on Llama-3-8B, utilizing IQ-DynamicGate technology for ultra-low-bit quantization (1-2 bits), enhancing accuracy while maintaining memory efficiency.

Large Language Model Supports Multiple Languages#Ultra-low-bit quantization #Dynamic precision allocation #Edge device optimization

Downloads 1,073

Release Time : 3/16/2025

Model Overview

This model is the 8B parameter instruction-tuned version in Meta's Llama-3 series, optimized for various inference tasks, particularly suitable for memory-constrained environments.

Model Features

IQ-DynamicGate Ultra-low-bit Quantization

Employs a hierarchical strategy for 1-2 bit quantization: the first/last 25% layers use IQ4_XS, while the middle 50% layers use IQ2_XXS/IQ3_S, significantly reducing perplexity.

Key Component Protection

Embedding and output layers use Q5_K quantization, reducing error propagation by up to 38%.

Memory Efficiency Optimization

Multiple quantization options (IQ1_S to Q8_0) cater to different memory needs, with the smallest model requiring only 2.1GB.

Model Capabilities

Text generation

Instruction following

Low-memory inference

CPU/edge device deployment

Use Cases

Memory-constrained Deployment

Edge Device Inference

Run large language models on memory-limited edge devices

IQ1_S quantized version requires only 2.1GB memory

Research Applications

Ultra-low-bit Quantization Research

Study the effects and optimization methods of 1-2 bit quantization

IQ1_M reduces perplexity by 43.9%

🚀 Llama-3.1-8B-Instruct GGUF Models

This project offers a series of Llama-3.1-8B-Instruct GGUF models with ultra-low-bit quantization techniques. These models are designed to balance memory efficiency and accuracy, making them suitable for various deployment scenarios, including memory-constrained environments and edge devices.

✨ Features

Ultra-Low-Bit Quantization: Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1 - 2 bit), with benchmark-proven improvements on Llama-3-8B.
Dynamic Precision Allocation: The method uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
Critical Component Protection: Embeddings and output layers use Q5_K to reduce error propagation by 38% compared to standard 1 - 2 bit quantization.
Multiple Model Formats: Available in various formats, including BF16, F16, and quantized models (Q4_K, Q6_K, Q8_0, etc.), to meet different hardware and memory requirements.

📦 Installation

The README does not provide specific installation steps. If you need to install these models, please refer to the official documentation or relevant repositories for detailed instructions.

💻 Usage Examples

Basic Usage

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Advanced Usage

# First, define a tool
def get_current_temperature(location: str) -> float:
    """
    Get the current temperature at a location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

# Next, create a chat and apply the chat template
messages = [
  {"role": "system", "content": "You are a bot that responds to weather queries."},
  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)

tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})

messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})

# After that, you can generate() again to let the model use the tool result in the chat.

📚 Documentation

Model Information

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.

Property	Details
Model Developer	Meta
Model Architecture	Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
Supported Languages	English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai
Model Release Date	July 23, 2024
Status	This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.
License	A custom commercial license, the Llama 3.1 Community License, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

Intended Use

Intended Use Cases: Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card.

How to Use

This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original llama codebase.

Use with transformers

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

Use with `llama`

Please follow the instructions in the repository.

To download Original checkpoints, see the example command below leveraging huggingface-cli:

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct

Hardware and Software

Training Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.
Training utilized a cumulative of 39.3M GPU hours of computation on H100 - 80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Model	Training Time (GPU hours)	Training Power Consumption (W)	Training Location-Based Greenhouse Gas Emissions (tons CO2eq)
Llama 3.1 8B	1.46M	700	420
Llama 3.1 70B	7.0M	700	2,040
Llama 3.1 405B	30.84M	700	8,930
Total	39.3M	-	11,390

The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.

Training Data

Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples.
Data Freshness: The pretraining data has a cutoff of December 2023.

Benchmark scores

In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library.

Base pretrained models

Category	Benchmark	# Shots	Metric	Llama 3 8B	Llama 3.1 8B	Llama 3 70B	Llama 3.1 70B	Llama 3.1 405B
General	MMLU	5	macro_avg/acc_char	66.7	66.7	79.5	79.3	85.2
General	MMLU-Pro (CoT)	5	macro_avg/acc_char	36.2	37.1	55.0	53.8	61.6
General	AGIEval English	3 - 5	average/acc_char	47.1	47.8	63.0	64.6	71.6
General	CommonSenseQA	7	acc_char	72.6	75.0	83.8	84.1	85.8
General	Winogrande	5	acc_char	-	60.5	-	83.3	86.7
General	BIG-Bench Hard (CoT)	3	average/em	61.1	64.2	81.3	81.6	85.9
General	ARC-Challenge	25	acc_char	79.4	79.7	93.1	92.9	96.1
Knowledge reasoning	TriviaQA-Wiki	5	em	78.5	77.6	89.7	89.8	91.8
Reading comprehension	SQuAD	1	em	76.4	77.0	85.6	81.8	89.3
Reading comprehension	QuAC (F1)	1	f1	44.4	44.9	51.1	51.1	53.6
Reading comprehension	BoolQ	0	acc_char	75.7	75.0	79.0	79.4	80.0
Reading comprehension	DROP (F1)	3	f1	58.4	59.5	79.7	79.6	84.8

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3 8B Instruct	Llama 3.1 8B Instruct	Llama 3 70B Instruct	Llama 3.1 70B Instruct	Llama 3.1 405B Instruct
General	MMLU	5	macro_avg/acc	68.5	69.4	82.0	83.6	87.3
General	MMLU (CoT)	0	macro_avg/acc	65.3	73.0	80.9	86.0	88.6
General	MMLU-Pro (CoT)	5	micro_avg/acc_char	45.5	48.3	63.4	66.4	73.3
General	IFEval	-	-	76.8	80.4	82.9	87.5	88.6
Reasoning	ARC-C	0	acc	82.4	83.4	94.4	94.8	96.9
Reasoning	GPQA	0	em	34.6	30.4	39.5	46.7	50.7
Code	HumanEval	0	pass@1	60.4	72.6	81.7	80.5	89.0
Code	MBPP ++ base version	0	pass@1	70.6	72.8	82.5	86.0	88.6
Code	Multipl-E HumanEval	0	pass@1	-	50.8	-	65.5	75.2
Code	Multipl-E MBPP	0	pass@1	-	52.4	-	62.0	65.7
Math	GSM-8K (CoT)	8	em_maj1@1	80.6	84.5	93.0	95.1	96.8
Math	MATH (CoT)	0	final_em	29.1	51.9	51.0	68.0	73.8
Tool Use	API-Bank	0	acc	48.3	82.6	85.1	90.0	92.0
Tool Use	BFCL	0	acc	60.3	76.1	83.0	84.8	88.5
Tool Use	Gorilla Benchmark API Bench	0	acc	1.7	8.2	14.7	29.7	35.3
Tool Use	Nexus (0-shot)	0	macro_avg/acc	18.1	38.5	47.8	56.7	58.7
Multilingual	Multilingual MGSM (CoT)	0	em	-	68.9	-	86.9	91.6

Multilingual benchmarks

Category	Benchmark	Language	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B
General	MMLU (5-shot, macro_avg/acc)	Portuguese	62.12	80.13	84.95
General	MMLU (5-shot, macro_avg/acc)	Spanish	62.45	80.05	85.08
General	MMLU (5-shot, macro_avg/acc)	Italian	61.63	80.4	85.04
General	MMLU (5-shot, macro_avg/acc)	German	60.59	79.27	84.36
General	MMLU (5-shot, macro_avg/acc)	French	62.34	79.82	84.66
General	MMLU (5-shot, macro_avg/acc)	Hindi	50.88	74.52	80.31
General	MMLU (5-shot, macro_avg/acc)	Thai	50.32	72.95	78.21

🔧 Technical Details

Ultra-Low-Bit Quantization

Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1 - 2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.

Dynamic Precision Allocation

First/Last 25% of layers → IQ4_XS (selected layers)
Middle 50% → IQ2_XXS/IQ3_S (increase efficiency)

Critical Component Protection

Embeddings/output layers use Q5_K
Reduces error propagation by 38% vs standard 1 - 2 bit

Quantization Performance Comparison (Llama-3-8B)

Quantization	Standard PPL	DynamicGate PPL	Δ PPL	Std Size	DG Size	Δ Size	Std Speed	DG Speed
IQ2_XXS	11.30	9.84	-12.9%	2.5G	2.6G	+0.1G	234s	246s
IQ2_XS	11.72	11.63	-0.8%	2.7G	2.8G	+0.1G	242s	246s
IQ2_S	14.31	9.02	-36.9%	2.7G	2.9G	+0.2G	238s	244s
IQ1_M	27.46	15.41	-43.9%	2.2G	2.5G	+0.3G	206s	212s
IQ1_S	53.07	32.00	-39.7%	2.1G	2.4G	+0.3G	184s	209s

Key Improvements

🔥 IQ1_M shows a massive 43.9% perplexity reduction (27.46 → 15.41)
🚀 IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
⚡ IQ1_S maintains 39.7% better accuracy despite 1-bit quantization

Tradeoffs

All variants have modest size increases (0.1 - 0.3GB)
Inference speeds remain comparable (<5% difference)

When to Use These Models

📌 Fitting models into GPU VRAM
✔ Memory-constrained deployments
✔ CPU and Edge Devices where 1 - 2 bit errors can be tolerated
✔ Research into ultra-low-bit quantization

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

Model Format	Precision	Memory Usage	Device Requirements	Best Use Case
BF16	Highest	High	BF16-supported GPU/CPUs	High-speed inference with reduced memory
F16	High	High	FP16-supported devices	GPU inference when BF16 isn't available
Q4_K	Medium Low	Low	CPU or Low-VRAM devices	Best for memory-constrained environments
Q6_K	Medium	Moderate	CPU with more memory	Better accuracy while still being quantized
Q8_0	High	Moderate	CPU or GPU with enough VRAM	Best accuracy among quantized models
IQ3_XS	Very Low	Very Low	Ultra-low-memory devices	Extreme memory efficiency and low accuracy
Q4_0	Low	Low	ARM or low-memory devices	llama.cpp can optimize for ARM devices

📄 License

A custom commercial license, the Llama 3.1 Community License, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご