# vLLM optimization

Gemma 3 4b It Quantized.w4a16
A quantized version based on google/gemma-3-4b-it, using INT4 weight quantization and FP16 activation quantization to optimize inference efficiency
Image-to-Text Transformers
G
RedHatAI
195
1
Bielik 4.5B V3.0 Instruct FP8 Dynamic
Apache-2.0
This model is the FP8 quantized version of Bielik-4.5B-v3.0-Instruct, utilizing AutoFP8 technology to quantize weights and activations into FP8 data type, reducing approximately 50% of disk space and GPU memory requirements.
Large Language Model Other
B
speakleash
74
1
Qwq 32B FP8 Dynamic
MIT
FP8 quantized version of QwQ-32B, reducing storage and memory requirements by 50% through dynamic quantization while maintaining 99.75% of the original model accuracy
Large Language Model Transformers
Q
RedHatAI
3,107
8
Meta Llama 3.1 70B FP8
FP8 quantized version of Meta-Llama-3.1-70B, suitable for multilingual business and research applications, with both weights and activations quantized to FP8 format, reducing storage and memory requirements by approximately 50%.
Large Language Model Transformers Supports Multiple Languages
M
RedHatAI
191
2
Meta Llama 3.1 8B FP8
FP8 quantized version of Meta-Llama-3.1-8B, suitable for multilingual business and research applications.
Large Language Model Transformers Supports Multiple Languages
M
RedHatAI
4,154
7
Meta Llama 3.1 70B Instruct FP8
FP8 quantized version of Meta-Llama-3.1-70B-Instruct, suitable for multilingual commercial and research purposes, especially ideal for assistant-like chat scenarios.
Large Language Model Transformers Supports Multiple Languages
M
RedHatAI
71.73k
45
Dolphin 2.9 Llama3 70b Awq
AWQ quantized version of Dolphin 2.9 Llama3 70B, suitable for vllm and other inference engines.
Large Language Model Transformers
D
julep-ai
19
5
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase