🚀 LLaMA-2-7B-32K
LLaMA-2-7B-32K is an open - source long - context language model, fine - tuned from Meta's Llama - 2 7B model, aiming to contribute to the open - source large language model ecosystem.
🚀 Quick Start
You can use the Together API to try out LLaMA-2-7B-32K for inference. To run the model locally, follow these steps:
# Please update the path of `CUDA_HOME`
export CUDA_HOME=/usr/local/cuda-11.8
pip install transformers==4.31.0
pip install sentencepiece
pip install ninja
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
✨ Features
This model introduces several improvements and new features:
- Extended Context: It can handle context lengths up to 32K, a significant improvement over previous versions.
- Pre - training and Instruction Tuning: The data recipe, a mixture of pre - training and instruction tuning data, is shared.
- Fine - tuning Examples: Examples for fine - tuning the model for specific applications like book summarization and long - context question - answering are provided.
- Software Support: Both the inference and training stack are updated for efficient inference and fine - tuning with 32K context.
📦 Installation
To run the model locally, you need to install some dependencies. Here are the commands:
# Please update the path of `CUDA_HOME`
export CUDA_HOME=/usr/local/cuda-11.8
pip install transformers==4.31.0
pip install sentencepiece
pip install ninja
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
Advanced Usage
Long Context QA
We take the multi - document question - answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts” as an example. With OCK, you can fine - tune the model using the following command:
bash training/finetune_llama-2-7b-32k-mqa.sh
Summarization
For the BookSum dataset, which is used for long - form narrative summarization, you can fine - tune the model with OCK using the following command:
bash training/finetune_llama-2-7b-32k-booksum.sh
📚 Documentation
Model Description
LLaMA-2-7B-32K is an open - source, long - context language model developed by Together, fine - tuned from Meta's original Llama - 2 7B model. It aims to contribute to the open - source ecosystem of large language models. The model has been extended to a context length of 32K with position interpolation, enabling applications in multi - document QA, long - text summarization, etc.
Model Architecture
The model follows the architecture of Llama - 2 - 7B and extends it to handle a longer context. It uses FlashAttention - 2 and other optimizations to improve the speed and efficiency of inference and training.
Training and Fine - tuning
The model is trained with a mixture of pre - training and instruction tuning data.
- Continued Pre - training: The data mixture contains 25% RedPajama Book, 25% RedPajama ArXiv (including abstracts), 25% other RedPajama data, and 25% UL2 Oscar Data. Data shorter than 2K words is excluded to enhance long - context ability.
- Fine - tuning: We fine - tune the model to focus on its few - shot capacity under long context. The data includes 20% Natural Instructions (NI), 20% Public Pool of Prompts (P3), 20% the Pile. All data is decontaminated against HELM core scenarios. We also incorporate 20% RedPajama - Data Book and 20% RedPajama - Data ArXiv to maintain the learned knowledge.
The example datasets are placed in togethercomputer/Long - Data - Collections. You can use the OpenChatKit to fine - tune your own 32K model over LLaMA - 2 - 7B - 32K. Refer to OpenChatKit for step - by - step illustrations.
🔧 Technical Details
The model leverages FlashAttention - 2 and a range of other optimizations to improve the speed and efficiency of inference and training. The training data mixture and fine - tuning strategies are carefully designed to enhance the long - context ability of the model. For example, in the continued pre - training phase, the inclusion of UL2 Oscar Data helps the model read and utilize long - range context. And during fine - tuning, data decontamination and specific data ratios are used to ensure the model's performance.
📄 License
The model is under the llama2 license.
Property |
Details |
Model Type |
LLaMA-2-7B-32K, an open - source long - context language model |
Training Data |
togethercomputer/RedPajama - Data - 1T, togethercomputer/RedPajama - Data - Instruct, EleutherAI/pile, togethercomputer/Long - Data - Collections |
⚠️ Important Note
As with all language models, LLaMA - 2 - 7B - 32K may generate incorrect or biased content. It's important to keep this in mind when using the model.
💡 Usage Tip
Join us on Together Discord to get more support and share your experiences.