Model Overview
Model Features
Model Capabilities
Use Cases
🚀 MistralLite Model
MistralLite is a fine - tuned Mistral - 7B - v0.1 language model. It has enhanced capabilities in processing long contexts, supporting up to 32K tokens. By using an adapted Rotary Embedding and a sliding window during fine - tuning, MistralLite significantly outperforms in several long - context retrieval and answering tasks while maintaining the simple model structure of the original model. It is useful for applications such as long - context line and topic retrieval, summarization, and question - answering. MistralLite can be deployed on a single AWS g5.2x
instance with Sagemaker [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text - generation - inference) endpoint, making it suitable for high - performance applications in resource - constrained environments. It also supports other serving methods like [vLLM](https://github.com/vllm - project/vllm), and can be used in Python with the HuggingFace transformers and [FlashAttention - 2](https://github.com/Dao - AILab/flash - attention) library.
✨ Features
MistralLite is similar to Mistral - 7B - Instruct - v0.1, and their similarities and differences are summarized in the following table:
Property | Details |
---|---|
Model Type | Mistral - 7B - v0.1 |
Fine - tuned on long contexts | Mistral - 7B - Instruct - v0.1: up to 8K tokens; MistralLite: up to 16K tokens |
Max context length | 32K |
RotaryEmbedding adaptation | Mistral - 7B - Instruct - v0.1: rope_theta = 10000; MistralLite: rope_theta = 1000000 |
Sliding Window Size | Mistral - 7B - Instruct - v0.1: 4096; MistralLite: 16384 |
⚠️ Important Note
Use the following prompt template for MistralLite:
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
📚 Documentation
Motivation of Developing MistralLite
Since the release of Mistral - 7B - Instruct - v0.1, the model has become increasingly popular due to its strong performance on a wide range of benchmarks. However, most of these benchmarks are evaluated on short context
, and little research has been done on its performance in long - context tasks. We evaluated Mistral - 7B - Instruct - v0.1
against benchmarks specifically designed to assess the capabilities of LLMs in handling longer contexts. Although the model's performance on long contexts less than 4096 tokens was fairly competitive, there were limitations in its performance on longer contexts. Motivated by improving its long - context performance, we fine - tuned the Mistral 7B model and produced Mistrallite
. The model significantly boosts the performance of long - context handling compared to Mistral - 7B - Instruct - v0.1. The detailed long context evaluation results
are as follows:
-
[Topic Retrieval](https://lmsys.org/blog/2023 - 06 - 29 - longchat/) | Model Name | Input length: 2851 | Input length: 5568 | Input length: 8313 | Input length: 11044 | Input length: 13780 | |----------|-------------:|-------------:|------------:|-----------:|-----------:| | Mistral - 7B - Instruct - v0.1 | 100% | 50% | 2% | 0% | 0% | | MistralLite | 100% | 100% | 100% | 100% | 98% |
-
[Line Retrieval](https://lmsys.org/blog/2023 - 06 - 29 - longchat/#longeval - results) | Model Name | Input length: 3818 | Input length: 5661 | Input length: 7505 | Input length: 9354 | Input length: 11188 | Input length: 12657 | |----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| | Mistral - 7B - Instruct - v0.1 | 98% | 62% | 42% | 42% | 32% | 30% | | MistralLite | 98% | 92% | 88% | 76% | 70% | 60% |
-
[Pass key Retrieval](https://github.com/epfml/landmark - attention/blob/main/llama/run_test.py#L101) | Model Name | Input length: 3264 | Input length: 5396 | Input length: 8329 | Input length: 10197 | |----------|-------------:|-------------:|------------:|-----------:| | Mistral - 7B - Instruct - v0.1 | 100% | 50% | 20% | 30% | | MistralLite | 100% | 100% | 100% | 100% |
-
[Question Answering with Long Input Texts](https://nyu - mll.github.io/quality/) | Model Name | Test set Accuracy | Hard subset Accuracy | |----------|-------------:|-------------:| | Mistral - 7B - Instruct - v0.1 | 44.3% | 39.7% | | MistralLite | 64.4% | 56.2% |
Example test on long context of 13400 tokens
- Context is from Amazon Aurora FAQs
- Question: please tell me how does pgvector help with Generative AI and give me some examples.
- Answer from MistralLite:
pgvector is an open - source extension for PostgreSQL supported by Amazon Aurora PostgreSQL - Compatible Edition. You can use pgvector to store, search, index, and query billions of embeddings that are generated from machine learning (ML) and artificial intelligence (AI) models in your database, such as those from Amazon Bedrock (limited preview) or Amazon SageMaker. A vector embedding is a numerical representation that represents the semantic meaning of content such as text, images, and video. With pgvector, you can query embeddings in your Aurora PostgreSQL database to perform efficient semantic similarity searches of these data types, represented as vectors, combined with other tabular data in Aurora. This enables the use of generative AI and other AI/ML systems for new types of applications such as personalized recommendations based on similar text descriptions or images, candidate match based on interview notes, customer service next best action recommendations based on successful transcripts or chat session dialogs, and more.
Model Details
Property | Details |
---|---|
Developed by | [AWS Contributors](https://github.com/orgs/aws - samples/teams/aws - prototype - ml - apac) |
Model type | Mistral - 7B - v0.1 |
Language | English |
Finetuned from weights | Mistral - 7B - v0.1 |
Finetuned on data | SLidingEncoder and Decoder (SLED); [(Long) Natural Questions (NQ)](https://huggingface.co/datasets/togethercomputer/Long - Data - Collections#multi - passage - qa - from - natural - questions); OpenAssistant Conversations Dataset (OASST1) |
Supported Serving Framework | [Text - Generation - Inference 1.1.0](https://github.com/huggingface/text - generation - inference/tree/v1.1.0); [vLLM](https://github.com/vllm - project/vllm); HuggingFace transformers; [HuggingFace Text Generation Inference (TGI) container on SageMaker](https://github.com/awslabs/llm - hosting - container) |
Model License | Apache 2.0 |
Contact | [GitHub issues](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/issues) |
Inference Code | [Github Repo](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/) |
MistralLite LM - Eval Results
Methodology
- Please see https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- revision = 4ececff
- Note: we used --model hf - causal - experimental instead of --model hf - causal
Results
Average | hellaswag | arc_challenge | truthful_qa (mc2) | MMLU (acc) |
---|---|---|---|---|
0.57221 | 0.81617 | 0.58874 | 0.38275 | 0.5012 |
💻 Usage Examples
How to Use MistralLite from Python Code (HuggingFace transformers)
⚠️ Important Note
For an end - to - end example Jupyter notebook, please refer to [this link](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/huggingface - transformers/example_usage.ipynb).
Install the necessary packages
Requires: transformers 4.34.0 or later, [flash - attn](https://pypi.org/project/flash - attn/) 2.3.1.post1 or later, and accelerate 0.23.0 or later.
pip install transformers==4.34.0
pip install flash - attn==2.3.1.post1 --no - build - isolation
pip install accelerate==0.23.0
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
model_id = "amazon/MistralLite"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
use_flash_attention_2=True,
device_map="auto",)
pipeline = transformers.pipeline(
"text - generation",
model=model,
tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
sequences = pipeline(
prompt,
max_new_tokens=400,
do_sample=False,
return_full_text=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"{seq['generated_text']}")
⚠️ Important Note
Use the following prompt template for MistralLite:
<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>
How to Serve MistralLite on TGI
⚠️ Important Note
- For an end - to - end example Jupyter notebook using the native TGI container, please refer to [this link](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/tgi/example_usage.ipynb).
- If the input context length is greater than 12K tokens, it is recommended using a custom TGI container, please refer to [this link](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/tgi - custom/example_usage.ipynb).
Start TGI server
Use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text - generation - inference:1.1.0
Example Docker parameters:
docker run -d --gpus all --shm - size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text - generation - inference:1.1.0 \
--model - id amazon/MistralLite \
--max - input - length 16000 \
--max - total - tokens 16384 \
--max - batch - prefill - tokens 16384 \
--trust - remote - code
Perform Inference
Example Python code for inference with TGI (requires text_generation
0.6.1 or later):
pip install text_generation==0.6.1
from text_generation import Client
SERVER_PORT = 443
SERVER_HOST = "localhost"
SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
def invoke_tgi(prompt,
random_seed=1,
max_new_tokens=400,
print_stream=True,
assist_role=True):
if (assist_role):
prompt = f"<|prompter|>{prompt}</s><|assistant|>"
output = ""
for response in tgi_client.generate_stream(
prompt,
do_sample=False,
max_new_tokens=max_new_tokens,
return_full_text=False,
#temperature=None,
#truncate=None,
#seed=random_seed,
#typical_p=0.2,
):
if hasattr(response, "token"):
if not response.token.special:
snippet = response.token.text
output += snippet
if (print_stream):
print(snippet, end='', flush=True)
return output
prompt = "What are the main challenges to support a long context for LLM?"
result = invoke_tgi(prompt)
⚠️ Important Note
When using MistralLite for inference for the first time, it may require a brief 'warm - up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm - up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
How to Deploy MistralLite on Amazon SageMaker
⚠️ Important Note
- For an end - to - end example Jupyter notebook using the SageMaker built - in container, please refer to [this link](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/sagemaker - tgi/example_usage.ipynb).
- If the input context length is greater than 12K tokens, it is recommended using a custom docker container, please refer to [this link](https://github.com/awslabs/extending - the - context - length - of - open - source - llms/blob/main/MistralLite/sagemaker - tgi - custom/example_usage.ipynb).
📄 License
The model is licensed under Apache 2.0.

