Model Selection

vLLM optimization

# vLLM optimization

Gemma 3 4b It Quantized.w4a16

A quantized version based on google/gemma-3-4b-it, using INT4 weight quantization and FP16 activation quantization to optimize inference efficiency

Bielik 4.5B V3.0 Instruct FP8 Dynamic

This model is the FP8 quantized version of Bielik-4.5B-v3.0-Instruct, utilizing AutoFP8 technology to quantize weights and activations into FP8 data type, reducing approximately 50% of disk space and GPU memory requirements.

Large Language Model Other

Qwq 32B FP8 Dynamic

FP8 quantized version of QwQ-32B, reducing storage and memory requirements by 50% through dynamic quantization while maintaining 99.75% of the original model accuracy

Large Language Model

Meta Llama 3.1 70B FP8

FP8 quantized version of Meta-Llama-3.1-70B, suitable for multilingual business and research applications, with both weights and activations quantized to FP8 format, reducing storage and memory requirements by approximately 50%.

Large Language Model

Transformers Supports Multiple Languages

Meta Llama 3.1 8B FP8

FP8 quantized version of Meta-Llama-3.1-8B, suitable for multilingual business and research applications.

Large Language Model

Transformers Supports Multiple Languages

Meta Llama 3.1 70B Instruct FP8

FP8 quantized version of Meta-Llama-3.1-70B-Instruct, suitable for multilingual commercial and research purposes, especially ideal for assistant-like chat scenarios.

Large Language Model

Transformers Supports Multiple Languages

Dolphin 2.9 Llama3 70b Awq

AWQ quantized version of Dolphin 2.9 Llama3 70B, suitable for vllm and other inference engines.

Large Language Model

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase