Elastic-Qwen2.5-7B-Instruct Open-source Elastic Model - Slide to Adjust Parameters, a Flexible Solution for Self-hosted Scenarios

Elastic Qwen2.5 7B Instruct

Developed by TheStageAI

The Elastic Model is a series of models generated by TheStage AI ANNA, allowing free adjustment of model scale, latency, and quality through a sliding control bar, providing the fastest and most flexible solution for self-hosting scenarios.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Dialogue #Adjustable Reasoning #Low Latency Optimization

Downloads 30

Release Time : 4/22/2025

Model Overview

The elastic version of Qwen2.5-7B-Instruct offers four optimization levels (XL/L/M/S), supports multilingual text generation tasks, and is suitable for scenarios requiring flexible balance between performance and quality.

Model Features

Elastic Adjustment

Freely adjust model scale, latency, and quality with a simple slider, offering four optimized versions (XL/L/M/S).

Multi-hardware Support

Supports H100/L40s GPUs and AMD/Intel CPU platforms, with pre-compilation eliminating the need for just-in-time compilation.

Transparent Benchmarking

Provides detailed latency and quality benchmark data to help users make informed choices.

Seamless Integration

Call HF ecosystem libraries with a single line of code, compatible with standard transformers.

Model Capabilities

Multilingual Text Generation

Instruction Following

Knowledge Q&A

Content Creation

Use Cases

Smart Assistant

Multilingual Customer Service Bot

Deploy an intelligent customer service system supporting 13 languages.

Reduces server costs while maintaining response speed.

Content Generation

Multilingual Content Creation

Automatically generate marketing copy tailored to different regional language preferences.

Increases content production efficiency by over 30%.

🚀 Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.

Elastic models are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables you to control model size, latency, and quality with a simple slider movement. For each model, ANNA generates a series of optimized models:

XL: A mathematically equivalent neural network, optimized by our DNN compiler.
L: A near lossless model, with less than 1% degradation on corresponding benchmarks.
M: A faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of elastic models:

Offer flexibility in cost vs quality selection for inference.
Provide clear quality and latency benchmarks.
Provide an interface for HF libraries (transformers and diffusers) with a single line of code.
Provide models supported on a wide range of hardware, which are pre - compiled and do not require JIT.
Provide the best models and service for self - hosting.

⚠️ Important Note

It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

Performance Graph

🚀 Quick Start

✨ Features

Elastic models provide a series of optimized models with different levels of accuracy and speed, allowing users to choose according to their needs. They also offer clear benchmarks and an easy - to - use interface with HF libraries.

📦 Installation

To work with our models, you need to run the following commands in your terminal:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Then go to app.thestage.ai, log in, and generate an API token from your profile page. Set up the API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

💻 Usage Examples

Basic Usage

To infer our models, you just need to replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "Qwen/Qwen2.5-7B-Instruct"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

🔧 Technical Details

System requirements:

GPUs: H100, L40s
CPU: AMD, Intel
Python: 3.10 - 3.12

📚 Documentation

Benchmarks

Benchmarking is a crucial procedure during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

Quality benchmarks

Property	Details
Model Type	Elastic models based on Qwen2.5 - 7B - Instruct
Training Data	Not specified

Metric/Model	S	M	L	XL	Original	W8A8, int8
arc_challenge	49.10	50.10	53.20	52.60	52.60	41.70
mmlu	71.70	73.00	74.10	73.50	73.50	64.60
piqa	77.00	78.20	78.80	79.50	79.50	67.10
winogrande	66.20	69.10	71.50	70.60	70.60	53.10

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows the model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows the model's understanding of real - world physics concepts.
Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning. Shows the model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks. Shows the model's capability to understand context and resolve ambiguity.

Latency benchmarks

100 input/300 output; tok/s:

GPU/Model	S	M	L	XL	Original	W8A8, int8
H100	201	173	162	135	62	201
L40S	76	67	61	47	43	78

📄 License

The model is licensed under the Apache - 2.0 license.