Elastic-Mistral-7B-Instruct-v0.3 Open-Source Model - Free Deployment, Supports Multilingual Text Generation

Elastic Mistral 7B Instruct V0.3

Developed by TheStageAI

Mistral-7B-Instruct-v0.3 is an instruction-tuned model based on Mistral-7B, supporting multilingual text generation tasks.

Large Language Model Open Source License:Apache-2.0 #Elastic Inference #Multilingual Generation #Quantization Acceleration

Downloads 68

Release Time : 4/2/2025

Model Overview

This model is a 7B-parameter large language model optimized through instruction tuning, suitable for text generation tasks in multiple languages. With elastic model technology, users can select different optimized versions based on needs to balance performance and quality.

Model Features

Elastic Model Optimization

Offers four optimized versions (XL, L, M, S), allowing users to flexibly choose between model size, latency, and quality based on requirements.

Multilingual Support

Supports text generation in 13 languages, including major languages such as Chinese, English, and French.

High-performance Inference

Achieves up to 186 tokens/sec generation speed on H100 GPUs, significantly improving inference efficiency.

Ease of Use

Compatible with Hugging Face transformers library, allowing switching between different optimized versions with just a single line of code.

Model Capabilities

Multilingual Text Generation

Instruction Understanding and Execution

Knowledge Q&A

Content Creation

Use Cases

Intelligent Assistant

Search Engine Assistant

Responds to user queries with accurate information.

As shown in examples, it can generate professional responses that fit the context.

Education

Concept Explanation

Explains professional concepts and principles.

Can clearly explain technical concepts such as DNN quantization.

🚀 Elastic model: Mistral-7B-Instruct-v0.3

Fastest and most flexible models for self-serving.

Elastic models are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables you to control model size, latency, and quality with a simple slider movement. For each model, ANNA generates a series of optimized models:

XL: A mathematically equivalent neural network, optimized with our DNN compiler.
L: A near lossless model, with less than 1% degradation on corresponding benchmarks.
M: A faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of elastic models:

Provide flexibility in cost vs quality selection for inference.
Provide clear quality and latency benchmarks.
Provide an interface to HF libraries (transformers and diffusers) with a single line of code.
Provide models supported on a wide range of hardware, which are pre - compiled and require no JIT.
Provide the best models and service for self - hosting.

⚠️ Important Note

Specific quality degradation can vary from model to model. For example, an S model may have 0.5% degradation.

🚀 Quick Start

✨ Features

Flexible Configuration: Control model size, latency, and quality easily.
Multiple Model Variants: XL, L, M, and S models with different trade - offs between speed and accuracy.
Simple Integration: Replace transformers import with elastic_models.transformers for inference.
Broad Hardware Support: Pre - compiled models for various hardware.

📦 Installation

To work with our models, run the following commands in your terminal:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Then go to app.thestage.ai, log in, and generate an API token from your profile page. Set up the API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

💻 Usage Examples

Basic Usage

To infer our models, you just need to replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

📚 Documentation

Benchmarks

Benchmarking is a crucial procedure during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

Quality benchmarks

Property	Details
Model Type	Elastic models based on Mistral-7B-Instruct-v0.3
Training Data	Not specified in the original document

Metric/Model	S	M	L	XL	Original	W8A8, int8
MMLU	59.7	60.1	60.8	61.4	61.4	28
PIQA	80.8	82	81.7	81.5	81.5	65.3
Arc Challenge	56.6	55.1	56.8	57.4	57.4	33.2
Winogrande	73.2	72.3	73.2	74.1	74.1	57

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows the model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows the model's understanding of real - world physics concepts.
Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning. Shows the model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks. Shows the model's capability to understand context and resolve ambiguity.

Latency benchmarks

100 input/300 output; tok/s:

GPU/Model	S	M	L	XL	Original	W8A8, int8
H100	186	180	168	136	48	192
L40s	79	68	59	47	38	82

📄 License

This project is licensed under the Apache - 2.0 license.

🔗 Links

Platform: app.thestage.ai
Subscribe for updates: TheStageAI X
Contact email: contact@thestage.ai

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご