đ Llama3-8B-1.58 Models
The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, offering high performance with extreme quantization.
đ Quick Start
You can easily load and test our model in Transformers. Just follow the steps below:
- Start by installing the transformers version with the correct configuration to load bitnet models:
pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head
- Then load the model:
model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
⨠Features
The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, starting from the base model Llama-3-8B-Instruct. For a deeper dive into the methods and results, check out our blog post.
đĻ Installation
To use this model, you need to install the transformers version with the correct configuration to load bitnet models:
pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head
đģ Usage Examples
Basic Usage
model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
đ Documentation
Model Details
Model Sources
Training Details
Training Data
The model was trained on a subset of FineWeb-edu
Training Process
- Starting Point: Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler
- Training Duration: Fine-tuned for an additional 45,000 steps, reaching a total of 100 billion tokens
- Dataset: FineWeb-edu dataset
- Batch Size: 2 million tokens per step, with a total of 90 billion tokens per run (45,000 steps * 2 million tokens), combined with the initial 10 billion tokens to reach 100 billion
- Learning Rate Experiments: Tested various learning rates to find the optimal setting. According to the experiments, the best performing peak lr is 1e-5
- Performance: Close to Llama3 8B on some metrics, but behind Llama3 8B in overall average performance
- Evaluation: Metrics included perplexity, MMLU scores, and other standard benchmarks
These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3.
Evaluation
The evaluation of the models is done on the nanotron checkpoints using LightEval:

đ License
No license information provided in the original document.
đ Citation
@misc{,
title={1.58-Bit LLM: A New Era of Extreme Quantization},
author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
year={2024},
}