LLaMA-2-7B-32K Open-Source Long Context Language Model - Free Deployment with Support for 32K Context Dialogues

Llama 2 7B 32K

Developed by togethercomputer

An open-source long-context language model fine-tuned based on Meta's original Llama-2 7B model, supporting 32K context length

Large Language Model

Transformers

English#32K Long Context #Multi-document QA #Long Text Summarization

Downloads 5,411

Release Time : 7/26/2023

Model Overview

LLaMA-2-7B-32K is an open-source long-context language model developed by Together, extending the context length to 32K through positional interpolation technology, suitable for tasks such as multi-document QA and long text summarization.

Model Features

Extended Context

The model is trained to handle contexts up to 32K in length, a significant improvement over previous versions.

Pre-training and Instruction Tuning

Publicly available data recipe includes a mix of pre-training and instruction tuning data.

Fine-tuning Examples

Provides fine-tuning examples for specific applications, including book summarization and long-context QA.

Software Support

Updated inference and training frameworks support efficient inference and fine-tuning for 32K contexts.

Model Capabilities

Long text generation

Multi-document QA

Long text summarization

Instruction following

Use Cases

Academic Research

Multi-document QA

Identify and utilize the correct answer document from multiple Wikipedia document fragments.

Content Generation

Book Summarization

Generate chapter-level summaries for novels, plays, and other literary works for long narrative summarization tasks.

🚀 LLaMA-2-7B-32K

LLaMA-2-7B-32K is an open - source long - context language model, fine - tuned from Meta's Llama - 2 7B model, aiming to contribute to the open - source large language model ecosystem.

🚀 Quick Start

You can use the Together API to try out LLaMA-2-7B-32K for inference. To run the model locally, follow these steps:

# Please update the path of `CUDA_HOME`
export CUDA_HOME=/usr/local/cuda-11.8
pip install transformers==4.31.0
pip install sentencepiece
pip install ninja
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

✨ Features

This model introduces several improvements and new features:

Extended Context: It can handle context lengths up to 32K, a significant improvement over previous versions.
Pre - training and Instruction Tuning: The data recipe, a mixture of pre - training and instruction tuning data, is shared.
Fine - tuning Examples: Examples for fine - tuning the model for specific applications like book summarization and long - context question - answering are provided.
Software Support: Both the inference and training stack are updated for efficient inference and fine - tuning with 32K context.

📦 Installation

To run the model locally, you need to install some dependencies. Here are the commands:

# Please update the path of `CUDA_HOME`
export CUDA_HOME=/usr/local/cuda-11.8
pip install transformers==4.31.0
pip install sentencepiece
pip install ninja
pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

Advanced Usage

Long Context QA

We take the multi - document question - answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts” as an example. With OCK, you can fine - tune the model using the following command:

bash training/finetune_llama-2-7b-32k-mqa.sh

Summarization

For the BookSum dataset, which is used for long - form narrative summarization, you can fine - tune the model with OCK using the following command:

bash training/finetune_llama-2-7b-32k-booksum.sh

📚 Documentation

Model Description

LLaMA-2-7B-32K is an open - source, long - context language model developed by Together, fine - tuned from Meta's original Llama - 2 7B model. It aims to contribute to the open - source ecosystem of large language models. The model has been extended to a context length of 32K with position interpolation, enabling applications in multi - document QA, long - text summarization, etc.

Model Architecture

The model follows the architecture of Llama - 2 - 7B and extends it to handle a longer context. It uses FlashAttention - 2 and other optimizations to improve the speed and efficiency of inference and training.

Training and Fine - tuning

The model is trained with a mixture of pre - training and instruction tuning data.

Continued Pre - training: The data mixture contains 25% RedPajama Book, 25% RedPajama ArXiv (including abstracts), 25% other RedPajama data, and 25% UL2 Oscar Data. Data shorter than 2K words is excluded to enhance long - context ability.
Fine - tuning: We fine - tune the model to focus on its few - shot capacity under long context. The data includes 20% Natural Instructions (NI), 20% Public Pool of Prompts (P3), 20% the Pile. All data is decontaminated against HELM core scenarios. We also incorporate 20% RedPajama - Data Book and 20% RedPajama - Data ArXiv to maintain the learned knowledge.

The example datasets are placed in togethercomputer/Long - Data - Collections. You can use the OpenChatKit to fine - tune your own 32K model over LLaMA - 2 - 7B - 32K. Refer to OpenChatKit for step - by - step illustrations.

🔧 Technical Details

The model leverages FlashAttention - 2 and a range of other optimizations to improve the speed and efficiency of inference and training. The training data mixture and fine - tuning strategies are carefully designed to enhance the long - context ability of the model. For example, in the continued pre - training phase, the inclusion of UL2 Oscar Data helps the model read and utilize long - range context. And during fine - tuning, data decontamination and specific data ratios are used to ensure the model's performance.

📄 License

The model is under the llama2 license.

Property	Details
Model Type	LLaMA-2-7B-32K, an open - source long - context language model
Training Data	togethercomputer/RedPajama - Data - 1T, togethercomputer/RedPajama - Data - Instruct, EleutherAI/pile, togethercomputer/Long - Data - Collections

⚠️ Important Note

As with all language models, LLaMA - 2 - 7B - 32K may generate incorrect or biased content. It's important to keep this in mind when using the model.

💡 Usage Tip

Join us on Together Discord to get more support and share your experiences.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご