Meta-Llama-3-8B-Instruct-bnb-8bit Open-source Model - Free Deployment for Efficient Text Generation

Meta Llama 3 8B Instruct Bnb 8bit

Developed by alokabhishek

This is the 8-bit quantization version of Meta's Meta-Llama-3-8B-Instruct model, quantized using bitsandbytes, suitable for efficient text generation tasks.

Large Language Model

Transformers

Open Source License:Other #8-bit quantization inference #English dialogue assistant #Efficient fine-tuning support

Downloads 2,310

Release Time : 4/25/2024

Model Overview

This model is a large language model fine-tuned with instructions, suitable for English business and research purposes, especially for conversational chat and natural language generation tasks.

Model Features

8-bit quantization

Use the bitsandbytes library for 8-bit quantization to reduce memory usage while maintaining high performance.

Efficient inference

Support efficient inference using the pipeline of the Transformers library, and custom inference is also available.

Suitable for multiple scenarios

Suitable for English text generation tasks in the business and research fields, such as conversational chat and natural language generation.

Security optimization

After red team testing and adversarial evaluation, security mitigation techniques are implemented to reduce residual risks.

Model Capabilities

Text generation

Conversational chat

Natural language understanding

Instruction following

Use Cases

Business applications

Virtual assistant

As an advanced virtual assistant, provide accurate and useful answers.

Capable of handling a wide range of user queries and providing high-quality responses.

Research

Natural language generation research

Used for researching various tasks in natural language generation.

Performs excellently in multiple benchmark tests.

🚀 Model Card for alokabhishek/Meta-Llama-3-8B-Instruct-bnb-8bit

This repository houses an 8-bit quantized model (using bitsandbytes) of Meta's Meta-Llama-3-8B-Instruct, offering a more efficient alternative for text generation tasks.

🚀 Quick Start

Use the following Python code to start working with the model:

import transformers
import torch

model_id = "alokabhishek/Meta-Llama-3-8B-Instruct-bnb-8bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

prompt_instruction = "You are a virtual assistant with advanced expertise in a broad spectrum of topics, equipped to utilize high-level critical thinking, cognitive skills, creativity, and innovation. Your goal is to deliver the most straightforward and accurate answer possible for each question, ensuring high-quality and useful responses for the user. "
user_prompt = "Why is Hulk always angry?"

chat_messages = [
            {"role": "system", "content": str(prompt_instruction)},
            {"role": "user", "content": str(user_prompt)},
        ]

prompt = pipeline.tokenizer.apply_chat_template(
        chat_messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

output = pipeline(
    prompt,
    do_sample=True,
    max_new_tokens=1024,
    temperature=1,
    top_k=50,
    top_p=1,
    num_return_sequences=1,
    pad_token_id=pipeline.tokenizer.pad_token_id,
    eos_token_id=terminators,
)

print(output[0]["generated_text"][len(prompt):])

✨ Features

8-bit Quantization: Utilizes bitsandbytes for efficient 8-bit quantization, reducing memory usage and potentially speeding up inference.
Text Generation: Specialized for text generation tasks, suitable for various natural language processing applications.

📦 Installation

No specific installation steps are provided in the original README. If you want to use the model, you need to have the transformers library installed. You can install it using pip install transformers.

💻 Usage Examples

Basic Usage

Transformers pipeline

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Transformers AutoModelForCausalLM

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Advanced Usage

To download the original checkpoints, you can use the following command:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir Meta-Llama-3-8B-Instruct

📚 Documentation

Model Details

Model creator: Meta
Original model: Meta-Llama-3-8B-Instruct

About 8 bit quantization using bitsandbytes

QLoRA: Efficient Finetuning of Quantized LLMs: arXiv - QLoRA: Efficient Finetuning of Quantized LLMs
Hugging Face Blog post on 8-bit quantization using bitsandbytes: A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
bitsandbytes github repo: bitsandbytes github repo

Meta Llama 3 Original Model Card

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, Meta took great care to optimize helpfulness and safety.

Property	Details
Model developers	Meta
Variations	Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants.
Input	Models input text only.
Output	Models generate text and code only.
Model Architecture	Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
Training Data	A new mix of publicly available online data.
Params	8B and 70B
Context length	8k
GQA	Yes
Token count	15T+
Knowledge cutoff	March, 2023 (8B); December, 2023 (70B)
Model Release Date	April 18, 2024.
Status	This is a static model trained on an offline dataset. Future versions of the tuned models will be released as Meta improves model safety with community feedback.
License	A custom commercial license is available at: https://llama.meta.com/llama3/license

Intended Use

Intended Use Cases: Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3 Community License. Use in languages other than English.
Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the Llama 3 Community License and the Acceptable Use Policy.

Hardware and Software

Training Factors: Meta used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.
Carbon Footprint: Pretraining utilized a cumulative 7.7M GPU hours of computation on hardware of type H100 - 80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program.

Property	Details
Time (GPU hours) - Llama 3 8B	1.3M
Time (GPU hours) - Llama 3 70B	6.4M
Time (GPU hours) - Total	7.7M
Power Consumption (W)	700
Carbon Emitted(tCO2eq) - Llama 3 8B	390
Carbon Emitted(tCO2eq) - Llama 3 70B	1900
Carbon Emitted(tCO2eq) - Total	2290

Training Data

Overview: Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.
Data Freshness: The pretraining data has a cutoff of March 2023 for the 8B and December 2023 for the 70B models respectively.

Benchmarks

In this section, the results for Llama 3 models on standard automatic benchmarks are reported. For all the evaluations, Meta's internal evaluations library is used. For details on the methodology see here.

Base pretrained models

Category	Benchmark	Llama 3 8B	Llama2 7B	Llama2 13B	Llama 3 70B	Llama2 70B
General	MMLU (5-shot)	66.6	45.7	53.8	79.5	69.7
General	AGIEval English (3 - 5 shot)	45.9	28.8	38.7	63.0	54.8
General	CommonSenseQA (7-shot)	72.6	57.6	67.6	83.8	78.7
General	Winogrande (5-shot)	76.1	73.3	75.4	83.1	81.8
General	BIG-Bench Hard (3-shot, CoT)	61.1	38.1	47.0	81.3	65.7
General	ARC-Challenge (25-shot)	78.6	53.7	67.6	93.0	85.3
Knowledge reasoning	TriviaQA-Wiki (5-shot)	78.5	72.1	79.6	89.7	87.5
Reading comprehension	SQuAD (1-shot)	76.4	72.2	72.1	85.6	82.6
Reading comprehension	QuAC (1-shot, F1)	44.4	39.6	44.9	51.1	49.4
Reading comprehension	BoolQ (0-shot)	75.7	65.5	66.9	79.0	73.1
Reading comprehension	DROP (3-shot, F1)	58.4	37.9	49.8	79.7	70.2

Instruction tuned models

Benchmark	Llama 3 8B	Llama 2 7B	Llama 2 13B	Llama 3 70B	Llama 2 70B
MMLU (5-shot)	68.4	34.1	47.8	82.0	52.9
GPQA (0-shot)	34.2	21.7	22.3	39.5	21.0
HumanEval (0-shot)	62.2	7.9	14.0	81.7	25.6
GSM-8K (8-shot, CoT)	79.6	25.7	77.4	93.0	57.5
MATH (4-shot, CoT)	30.0	3.8	6.7	50.4	11.6

Responsibility & Safety

Meta believes that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. Meta is committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community.

As part of the Llama 3 release, Meta updated its Responsible Use Guide to outline the steps and best practices for developers to implement model and system level safety for their application. Meta also provides a set of resources including Meta Llama Guard 2 and Code Shield safeguards.

⚠️ Important Note

Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications.

💡 Usage Tip

Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. They should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case.

Llama 3-Instruct

Safety: For the instruction tuned model, Meta conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and Meta recommends that developers assess these risks in the context of their use case.
Refusals: Meta put a great emphasis on model refusals to benign prompts. Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2. Meta built internal benchmarks and developed mitigations to limit false refusals, making Llama 3 the most helpful model to date.

Responsible release

Meta followed a rigorous process that requires it to take extra measures against misuse and critical risks before making its release decision.

Misuse: If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at https://llama.meta.com/llama3/use-policy/.
Critical risks: Meta conducted a two-fold assessment of the safety of the model in areas such as CBRNE, Cyber Security, and Child Safety.

Community

Generative AI safety requires expertise and tooling, and Meta believes in the strength of the open community to accelerate its progress. Meta is an active member of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to the development of AI safety standards.

📄 License

The license for this model is other with the license name llama3. The license link is LICENSE.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご