Elastic-DeepSeek-R1-Distill-Llama-8B Open-Source Model - Multi-version Adaptation for Multiple Scenarios, Supports Multilingual Text Generation

Elastic DeepSeek R1 Distill Llama 8B

Developed by TheStageAI

An elastic model generated by TheStage AI's ANNA, offering multiple optimized versions to adapt to different scenario requirements, supporting multilingual text generation.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Elastic Inference #Multilingual Generation #Low Latency Optimization

Downloads 60

Release Time : 4/24/2025

Model Overview

DeepSeek-R1-Distill-Llama-8B is an 8B-parameter large language model based on the Llama architecture, providing multiple optimized versions (XL/L/M/S) via ANNA technology for efficient inference in self-hosting scenarios.

Model Features

Elastic Version Selection

Offers four optimized versions (XL/L/M/S), allowing users to flexibly balance between model quality and inference speed based on needs.

Multi-Hardware Support

Supports H100/L40s GPUs and AMD/Intel CPUs, with pre-compilation eliminating the need for just-in-time compilation.

Multilingual Capabilities

Supports text generation tasks in 13 languages.

Quantization Optimization

ANNA technology optimizes the quantization of sensitive layers, with the S version significantly improving quality while maintaining speed.

Model Capabilities

Multilingual Text Generation

Knowledge Q&A

Common-Sense Reasoning

Context Understanding

Use Cases

Intelligent Assistant

Search Q&A Assistant

Answers various knowledge-based questions from users

Achieved 54.7-55.5 points (out of 100) in MMLU tests.

Content Generation

Multilingual Content Creation

Generates marketing copy or social media content in 13 languages

🚀 Elastic model: DeepSeek-R1-Distill-Llama-8B. Fastest and most flexible models for self-serving.

Elastic models are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables you to control model size, latency, and quality with a simple slider movement. For each model, ANNA generates a series of optimized models:

XL: A mathematically equivalent neural network optimized by our DNN compiler.
L: A near lossless model with less than 1% degradation on corresponding benchmarks.
M: A faster model with accuracy degradation less than 1.5%.
S: The fastest model with accuracy degradation less than 2%.

Goals of elastic models:

Offer flexibility in cost - quality selection for inference.
Provide clear quality and latency benchmarks.
Provide an interface for HF libraries (transformers and diffusers) with a single line of code.
Provide models supported on a wide range of pre - compiled hardware, requiring no JIT.
Provide the best models and service for self - hosting.

⚠️ Important Note

It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

Performance Graph

🚀 Quick Start

📦 Installation

To work with our models, follow these steps:

Install the necessary packages in your terminal:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Go to app.thestage.ai, log in, and generate an API token from your profile page. Then set up the API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

💻 Usage Examples

Basic Usage

To infer our models, you just need to replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

System Requirements

GPUs: H100, L40s
CPU: AMD, Intel
Python: 3.10 - 3.12

📚 Documentation

Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

Quality benchmarks

Metric/Model	S	M	L	XL	Original	W8A8, int8
arc_challenge	38.70	40.40	40.40	40.50	40.50	19.30
mmlu	52.70	54.70	55.50	54.80	54.80	47.70
piqa	76.30	75.90	75.70	76.10	76.10	55.00
winogrande	66.60	66.20	67.80	68.00	68.00	56.10

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real - world physics concepts.
Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.

Latency benchmarks

100 input/300 output; tok/s:

GPU/Model	S	M	L	XL	Original	W8A8, int8
H100	194	191	161	131	58	198
L40S	72	70	56	44	40	74

📄 License

This project is licensed under the Apache 2.0 license.

🔗 Links

Platform: app.thestage.ai
Subscribe for updates: TheStageAI X
Contact email: contact@thestage.ai

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご