Elastic-Llama-3.1-8B-Instruct Open Source Model - Supports Self-Deployment with Optional Variants of Diverse Precision and Speed

Elastic Llama 3.1 8B Instruct

Developed by TheStageAI

An elastically optimized version of Meta-Llama-3.1-8B-Instruct, offering model variants with different speed and precision levels, suitable for self-deployment scenarios.

Large Language Model Open Source License:Apache-2.0 #Elastic Inference #Multilingual Generation #Quantization Optimization

Downloads 125

Release Time : 4/13/2025

Model Overview

This model is a quantized version of Meta-Llama-3.1-8B-Instruct, generated via ANNA (Automated Neural Network Accelerator), providing four optimized variants: XL, L, M, and S. Users can flexibly choose between speed and quality based on their needs.

Model Features

Elastic Adjustment

Easily adjust model size, latency, and quality with a simple slider control, offering four optimized variants: XL, L, M, and S.

High-Performance Optimization

Optimized via DNN compiler, providing mathematically equivalent neural networks that enhance inference speed while maintaining high quality.

Multi-Hardware Support

Supports various hardware platforms, including H100/L40s GPUs and AMD/Intel CPUs, with pre-compilation eliminating the need for just-in-time (JIT) compilation.

Compatibility

Compatible with HF libraries (transformers/diffusers), callable with a single line of code, and supports multilingual text generation.

Model Capabilities

Multilingual Text Generation

High-Quality Inference

Low-Latency Response

Elastic Model Adjustment

Use Cases

Search Engines

Q&A Systems

Serves as a search engine to answer user queries, providing high-quality multilingual responses.

Performs excellently on benchmarks like MMLU, with a comprehensive knowledge score of 65.8 (S variant).

Education

Concept Explanation

Explains complex concepts, such as the basic principles of DNN quantization.

Scores 77.6 (S variant) on the PIQA test for physical commonsense reasoning.

🚀 Elastic model: Meta-Llama-3.1-8B-Instruct. Fastest and most flexible models for self-serving.

Elastic models are designed to offer the fastest and most flexible solutions for self-serving. They are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables you to easily control model size, latency, and quality with a simple slider movement. For each base model, ANNA generates a series of optimized models:

XL: A mathematically equivalent neural network, optimized using our DNN compiler.
L: A near lossless model, with less than 1% degradation on corresponding benchmarks.
M: A faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of elastic models

Provide flexibility in cost vs quality selection for inference.
Offer clear quality and latency benchmarks.
Provide an interface with HF libraries (transformers and diffusers) in a single line of code.
Support a wide range of hardware, with pre - compiled models that require no JIT.
Provide the best models and service for self - hosting.

⚠️ Important Note

It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

image/png

🚀 Quick Start

📦 Installation

To work with our models, follow these steps:

Install the necessary packages:

pip install thestage
pip install elastic_models[nvidia] --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple --extra-index-url https://pypi.nvidia.com --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Go to app.thestage.ai, log in, and generate an API token from your profile page. Then set up the API token:

thestage config set --api-token <YOUR_API_TOKEN>

💻 Usage Examples

Basic Usage

To infer our models, you just need to replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

System Requirements

Property	Details
GPUs	H100, L40s
CPU	AMD, Intel
Python	3.10 - 3.12

📚 Documentation

📊 Benchmarks

Benchmarking is a crucial procedure during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

Quality benchmarks

Metric/Model	S	M	L	XL	Original	W8A8, int8
MMLU	65.8	66.8	67.5	68.2	68.2	24.3
PIQA	77.6	79.3	79.8	79.8	79.8	64.6
Arc Challenge	50.7	50.3	52.3	51.7	51.7	29.6
Winogrande	72.5	72	73.3	73.9	73.9	62.8

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows the model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows the model's understanding of real - world physics concepts.
Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning. Shows the model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks. Shows the model's capability to understand context and resolve ambiguity.

Latency benchmarks

100 input/300 output; tok/s:

GPU/Model	S	M	L	XL	Original	W8A8, int8
H100	189	175	159	132	60	191
L40s	73	64	57	45	40	77

📄 License

This project is licensed under the Apache 2.0 license.

🔗 Links

Platform: app.thestage.ai
Subscribe for updates: TheStageAI X
Contact email: contact@thestage.ai

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご