Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Model Card for alokabhishek/Meta-Llama-3-8B-Instruct-bnb-8bit
This repository houses an 8-bit quantized model (using bitsandbytes) of Meta's Meta-Llama-3-8B-Instruct, offering a more efficient alternative for text generation tasks.
🚀 Quick Start
Use the following Python code to start working with the model:
import transformers
import torch
model_id = "alokabhishek/Meta-Llama-3-8B-Instruct-bnb-8bit"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
prompt_instruction = "You are a virtual assistant with advanced expertise in a broad spectrum of topics, equipped to utilize high-level critical thinking, cognitive skills, creativity, and innovation. Your goal is to deliver the most straightforward and accurate answer possible for each question, ensuring high-quality and useful responses for the user. "
user_prompt = "Why is Hulk always angry?"
chat_messages = [
{"role": "system", "content": str(prompt_instruction)},
{"role": "user", "content": str(user_prompt)},
]
prompt = pipeline.tokenizer.apply_chat_template(
chat_messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
output = pipeline(
prompt,
do_sample=True,
max_new_tokens=1024,
temperature=1,
top_k=50,
top_p=1,
num_return_sequences=1,
pad_token_id=pipeline.tokenizer.pad_token_id,
eos_token_id=terminators,
)
print(output[0]["generated_text"][len(prompt):])
✨ Features
- 8-bit Quantization: Utilizes bitsandbytes for efficient 8-bit quantization, reducing memory usage and potentially speeding up inference.
- Text Generation: Specialized for text generation tasks, suitable for various natural language processing applications.
📦 Installation
No specific installation steps are provided in the original README. If you want to use the model, you need to have the transformers
library installed. You can install it using pip install transformers
.
💻 Usage Examples
Basic Usage
Transformers pipeline
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
Transformers AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Advanced Usage
To download the original checkpoints, you can use the following command:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir Meta-Llama-3-8B-Instruct
📚 Documentation
Model Details
- Model creator: Meta
- Original model: Meta-Llama-3-8B-Instruct
About 8 bit quantization using bitsandbytes
- QLoRA: Efficient Finetuning of Quantized LLMs: arXiv - QLoRA: Efficient Finetuning of Quantized LLMs
- Hugging Face Blog post on 8-bit quantization using bitsandbytes: A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
- bitsandbytes github repo: bitsandbytes github repo
Meta Llama 3 Original Model Card
Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, Meta took great care to optimize helpfulness and safety.
Property | Details |
---|---|
Model developers | Meta |
Variations | Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants. |
Input | Models input text only. |
Output | Models generate text and code only. |
Model Architecture | Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. |
Training Data | A new mix of publicly available online data. |
Params | 8B and 70B |
Context length | 8k |
GQA | Yes |
Token count | 15T+ |
Knowledge cutoff | March, 2023 (8B); December, 2023 (70B) |
Model Release Date | April 18, 2024. |
Status | This is a static model trained on an offline dataset. Future versions of the tuned models will be released as Meta improves model safety with community feedback. |
License | A custom commercial license is available at: https://llama.meta.com/llama3/license |
Intended Use
- Intended Use Cases: Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3 Community License. Use in languages other than English.
- Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the Llama 3 Community License and the Acceptable Use Policy.
Hardware and Software
- Training Factors: Meta used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.
- Carbon Footprint: Pretraining utilized a cumulative 7.7M GPU hours of computation on hardware of type H100 - 80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program.
Property | Details |
---|---|
Time (GPU hours) - Llama 3 8B | 1.3M |
Time (GPU hours) - Llama 3 70B | 6.4M |
Time (GPU hours) - Total | 7.7M |
Power Consumption (W) | 700 |
Carbon Emitted(tCO2eq) - Llama 3 8B | 390 |
Carbon Emitted(tCO2eq) - Llama 3 70B | 1900 |
Carbon Emitted(tCO2eq) - Total | 2290 |
Training Data
- Overview: Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.
- Data Freshness: The pretraining data has a cutoff of March 2023 for the 8B and December 2023 for the 70B models respectively.
Benchmarks
In this section, the results for Llama 3 models on standard automatic benchmarks are reported. For all the evaluations, Meta's internal evaluations library is used. For details on the methodology see here.
Base pretrained models
Category | Benchmark | Llama 3 8B | Llama2 7B | Llama2 13B | Llama 3 70B | Llama2 70B |
---|---|---|---|---|---|---|
General | MMLU (5-shot) | 66.6 | 45.7 | 53.8 | 79.5 | 69.7 |
General | AGIEval English (3 - 5 shot) | 45.9 | 28.8 | 38.7 | 63.0 | 54.8 |
General | CommonSenseQA (7-shot) | 72.6 | 57.6 | 67.6 | 83.8 | 78.7 |
General | Winogrande (5-shot) | 76.1 | 73.3 | 75.4 | 83.1 | 81.8 |
General | BIG-Bench Hard (3-shot, CoT) | 61.1 | 38.1 | 47.0 | 81.3 | 65.7 |
General | ARC-Challenge (25-shot) | 78.6 | 53.7 | 67.6 | 93.0 | 85.3 |
Knowledge reasoning | TriviaQA-Wiki (5-shot) | 78.5 | 72.1 | 79.6 | 89.7 | 87.5 |
Reading comprehension | SQuAD (1-shot) | 76.4 | 72.2 | 72.1 | 85.6 | 82.6 |
Reading comprehension | QuAC (1-shot, F1) | 44.4 | 39.6 | 44.9 | 51.1 | 49.4 |
Reading comprehension | BoolQ (0-shot) | 75.7 | 65.5 | 66.9 | 79.0 | 73.1 |
Reading comprehension | DROP (3-shot, F1) | 58.4 | 37.9 | 49.8 | 79.7 | 70.2 |
Instruction tuned models
Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
---|---|---|---|---|---|
MMLU (5-shot) | 68.4 | 34.1 | 47.8 | 82.0 | 52.9 |
GPQA (0-shot) | 34.2 | 21.7 | 22.3 | 39.5 | 21.0 |
HumanEval (0-shot) | 62.2 | 7.9 | 14.0 | 81.7 | 25.6 |
GSM-8K (8-shot, CoT) | 79.6 | 25.7 | 77.4 | 93.0 | 57.5 |
MATH (4-shot, CoT) | 30.0 | 3.8 | 6.7 | 50.4 | 11.6 |
Responsibility & Safety
Meta believes that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. Meta is committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community.
As part of the Llama 3 release, Meta updated its Responsible Use Guide to outline the steps and best practices for developers to implement model and system level safety for their application. Meta also provides a set of resources including Meta Llama Guard 2 and Code Shield safeguards.
⚠️ Important Note
Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications.
💡 Usage Tip
Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. They should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case.
Llama 3-Instruct
- Safety: For the instruction tuned model, Meta conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and Meta recommends that developers assess these risks in the context of their use case.
- Refusals: Meta put a great emphasis on model refusals to benign prompts. Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2. Meta built internal benchmarks and developed mitigations to limit false refusals, making Llama 3 the most helpful model to date.
Responsible release
Meta followed a rigorous process that requires it to take extra measures against misuse and critical risks before making its release decision.
- Misuse: If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at https://llama.meta.com/llama3/use-policy/.
- Critical risks: Meta conducted a two-fold assessment of the safety of the model in areas such as CBRNE, Cyber Security, and Child Safety.
Community
Generative AI safety requires expertise and tooling, and Meta believes in the strength of the open community to accelerate its progress. Meta is an active member of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to the development of AI safety standards.
📄 License
The license for this model is other
with the license name llama3
. The license link is LICENSE.

