Llama3-8B-1.58-100B-tokens Open Source Large Language Model - Supports Ultralong Conversations, Free and Convenient to Use!

Llama3 8B 1.58 100B Tokens

Developed by HF1BitLLM

Large language model fine-tuned based on BitNet 1.58b architecture, with Llama-3-8B-Instruct as the base model, employing extreme quantization techniques

Large Language Model

Transformers

#1.58-bit quantization #Efficient fine-tuning #Education domain optimization

Downloads 2,427

Release Time : 9/10/2024

Model Overview

Llama3-8B-1.58 is an efficient large language model utilizing 1.58-bit quantization, optimized through 100 billion tokens of training, significantly reducing computational resource requirements while maintaining performance

Model Features

Extreme quantization technology

Employs 1.58-bit quantization architecture, significantly reducing model storage and computational requirements

Large-scale training

Trained with 100 billion tokens, achieving performance close to half-precision models

Efficient inference

Reduces resource consumption while maintaining good performance

Model Capabilities

Text generation

Q&A systems

Logical reasoning

Use Cases

Education

Reasoning Q&A

Solving multi-step reasoning problems, such as tracking character position changes

Capable of correctly answering reasoning questions involving multi-step position changes

Research

Quantization technology research

Exploring the performance boundaries of LLMs under extreme quantization conditions

Performance close to half-precision models

🚀 Llama3-8B-1.58 Models

The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, offering high performance with extreme quantization.

🚀 Quick Start

You can easily load and test our model in Transformers. Just follow the steps below:

Start by installing the transformers version with the correct configuration to load bitnet models:

pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head

Then load the model:

model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)    
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"

input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

✨ Features

The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, starting from the base model Llama-3-8B-Instruct. For a deeper dive into the methods and results, check out our blog post.

📦 Installation

To use this model, you need to install the transformers version with the correct configuration to load bitnet models:

pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head

💻 Usage Examples

Basic Usage

model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-100B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)    
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"

input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

📚 Documentation

Model Details

Model Sources

Repository: Model
Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Training Details

Training Data

The model was trained on a subset of FineWeb-edu

Training Process

Starting Point: Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler
Training Duration: Fine-tuned for an additional 45,000 steps, reaching a total of 100 billion tokens
Dataset: FineWeb-edu dataset
Batch Size: 2 million tokens per step, with a total of 90 billion tokens per run (45,000 steps * 2 million tokens), combined with the initial 10 billion tokens to reach 100 billion
Learning Rate Experiments: Tested various learning rates to find the optimal setting. According to the experiments, the best performing peak lr is 1e-5
Performance: Close to Llama3 8B on some metrics, but behind Llama3 8B in overall average performance
Evaluation: Metrics included perplexity, MMLU scores, and other standard benchmarks

These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3.

Evaluation

The evaluation of the models is done on the nanotron checkpoints using LightEval: results

📄 License

No license information provided in the original document.

📚 Citation

@misc{,
      title={1.58-Bit LLM: A New Era of Extreme Quantization}, 
      author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
      year={2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご