Model Overview
Model Features
Model Capabilities
Use Cases
๐ Qwen2.5-VL-72B-Instruct GGUF Models
This project provides the Qwen2.5-VL-72B-Instruct GGUF models, which support multimodal tasks. These models offer various quantization options to meet different hardware and memory requirements.
๐ Quick Start
How to Use Qwen 2.5 VL Instruct with llama.cpp (latest as of 10th May 2025)
-
Download the Qwen 2.5 VL gguf file:
- Visit https://huggingface.co/Mungert/Qwen2.5-VL-72B-Instruct-GGUF/tree/main.
- Choose a gguf file without the mmproj in the name. For example: https://huggingface.co/Mungert/Mungert/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-q8_0.gguf.
- Copy this file to your chosen folder.
-
Download the Qwen 2.5 VL mmproj file:
- Visit https://huggingface.co/Mungert/Qwen2.5-VL-72B-Instruct-GGUF/tree/main.
- Choose a file with mmproj in the name. For example: https://huggingface.co/Mungert/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-mmproj-f16.gguf.
- Copy this file to your chosen folder.
-
Copy images to the same folder as the gguf files or alter paths appropriately.
- In the example below the gguf files, images and llama - mtmd - cli are in the same folder.
- Example image: https://huggingface.co/Mungert/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/car-1.jpg.
- Copy this file to your chosen folder.
-
Run the CLI Tool:
- From your chosen folder, run the following command:
llama - mtmd - cli -m Qwen2.5-VL-72B-Instruct-q8_0.gguf --mmproj Qwen2.5-VL-72B-Instruct-mmproj-f16.gguf -p "Describe this image." --image ./car-1.jpg
โจ Features
Ultra - Low - Bit Quantization with IQ - DynamicGate (1 - 2 bit)
Our latest quantization method introduces precision - adaptive quantization for ultra - low - bit models (1 - 2 bit), with benchmark - proven improvements on Llama - 3 - 8B. This approach uses layer - specific strategies to preserve accuracy while maintaining extreme memory efficiency.
Benchmark Context
All tests were conducted on Llama - 3 - 8B - Instruct using:
- Standard perplexity evaluation pipeline
- 2048 - token context window
- The same prompt set across all quantizations
Method
- Dynamic Precision Allocation:
- First/Last 25% of layers โ IQ4_XS (selected layers)
- Middle 50% โ IQ2_XXS/IQ3_S (increase efficiency)
- Critical Component Protection:
- Embeddings/output layers use Q5_K
- Reduces error propagation by 38% compared to standard 1 - 2bit
Quantization Performance Comparison (Llama - 3 - 8B)
Quantization | Standard PPL | DynamicGate PPL | ฮ PPL | Std Size | DG Size | ฮ Size | Std Speed | DG Speed |
---|---|---|---|---|---|---|---|---|
IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
Key:
- PPL = Perplexity (lower is better)
- ฮ PPL = Percentage change from standard to DynamicGate
- Speed = Inference time (CPU avx2, 2048 token context)
- Size differences reflect mixed quantization overhead
Key Improvements:
- ๐ฅ IQ1_M shows a massive 43.9% perplexity reduction (27.46 โ 15.41)
- ๐ IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
- โก IQ1_S maintains 39.7% better accuracy despite 1 - bit quantization
Tradeoffs:
- All variants have modest size increases (0.1 - 0.3GB)
- Inference speeds remain comparable (<5% difference)
When to Use These Models
- ๐ Fitting models into GPU VRAM
- โ Memory - constrained deployments
- โ CPU and Edge Devices where 1 - 2bit errors can be tolerated
- โ Research into ultra - low - bit quantization
Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) โ Use if BF16 acceleration is available
- A 16 - bit floating - point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your device's specs).
- Ideal for high - performance inference with reduced memory footprint compared to FP32.
๐ Use BF16 if:
- โ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
- โ You want higher precision while saving memory.
- โ You plan to requantize the model into another format.
๐ Avoid BF16 if:
- โ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
- โ You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) โ More widely supported than BF16
- A 16 - bit floating - point format with high precision but a smaller range of values than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
๐ Use F16 if:
- โ Your hardware supports FP16 but not BF16.
- โ You need a balance between speed, memory usage, and accuracy.
- โ You are running on a GPU or another device optimized for FP16 computations.
๐ Avoid F16 if:
- โ Your device lacks native FP16 support (it may run slower than expected).
- โ You have memory limitations.
Quantized Models (Q4_K, Q6_K, Q8, etc.) โ For CPU & Low - VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower - bit models (Q4_K) โ Best for minimal memory usage, may have lower precision.
- Higher - bit models (Q6_K, Q8_0) โ Better accuracy, requires more memory.
๐ Use Quantized Models if:
- โ You are running inference on a CPU and need an optimized model.
- โ Your device has low VRAM and cannot load full - precision models.
- โ You want to reduce memory footprint while keeping reasonable accuracy.
๐ Avoid Quantized Models if:
- โ You need maximum accuracy (full - precision models are better for this).
- โ Your hardware has enough VRAM for higher - precision formats (BF16/F16).
Very Low - Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These models are optimized for extreme memory efficiency, making them ideal for low - power devices or large - scale deployments where memory is a critical constraint.
- IQ3_XS: Ultra - low - bit quantization (3 - bit) with extreme memory efficiency.
- Use case: Best for ultra - low - memory devices where even Q4_K is too large.
- Trade - off: Lower accuracy compared to higher - bit quantizations.
- IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low - memory devices where IQ3_XS is too aggressive.
- IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low - memory devices where IQ3_S is too limiting.
- Q4_K: 4 - bit quantization with block - wise optimization for better accuracy.
- Use case: Best for low - memory devices where Q6_K is too large.
- Q4_0: Pure 4 - bit quantization, optimized for ARM devices.
- Use case: Best for ARM - based devices or low - memory environments.
Summary Table: Model Format Selection
Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
---|---|---|---|---|
BF16 | Highest | High | BF16 - supported GPU/CPUs | High - speed inference with reduced memory |
F16 | High | High | FP16 - supported devices | GPU inference when BF16 isn't available |
Q4_K | Medium Low | Low | CPU or Low - VRAM devices | Best for memory - constrained environments |
Q6_K | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
Q8_0 | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
IQ3_XS | Very Low | Very Low | Ultra - low - memory devices | Extreme memory efficiency and low accuracy |
Q4_0 | Low | Low | ARM or low - memory devices | llama.cpp can optimize for ARM devices |
๐ Documentation
Included Files & Details
Qwen2.5-VL-72B-Instruct-bf16.gguf
- Model weights are preserved in BF16.
- Use this if you want to requantize the model into a different format.
- Best if your device supports BF16 acceleration.
Qwen2.5-VL-72B-Instruct-f16.gguf
- Model weights are stored in F16.
- Use if your device supports FP16, especially if BF16 is not available.
Qwen2.5-VL-72B-Instruct-bf16-q8_0.gguf
- Output & embeddings remain in BF16.
- All other layers are quantized to Q8_0.
- Use if your device supports BF16 and you want a quantized version.
Qwen2.5-VL-72B-Instruct-f16-q8_0.gguf
- Output & embeddings remain in F16.
- All other layers are quantized to Q8_0.
Qwen2.5-VL-72B-Instruct-q4_k.gguf
- Output & embeddings are quantized to Q8_0.
- All other layers are quantized to Q4_K.
- Good for CPU inference with limited memory.
Qwen2.5-VL-72B-Instruct-q4_k_s.gguf
- The smallest Q4_K variant, using less memory at the cost of accuracy.
- Best for very low - memory setups.
Qwen2.5-VL-72B-Instruct-q6_k.gguf
- Output & embeddings are quantized to Q8_0.
- All other layers are quantized to Q6_K.
Qwen2.5-VL-72B-Instruct-q8_0.gguf
- A fully Q8 quantized model for better accuracy.
- Requires more memory but offers higher precision.
Qwen2.5-VL-72B-Instruct-iq3_xs.gguf
- IQ3_XS quantization, optimized for extreme memory efficiency.
- Best for ultra - low - memory devices.
Qwen2.5-VL-72B-Instruct-iq3_m.gguf
- IQ3_M quantization, offering a medium block size for better accuracy.
- Suitable for low - memory devices.
Qwen2.5-VL-72B-Instruct-q4_0.gguf
- Pure Q4_0 quantization, optimized for ARM devices.
- Best for low - memory environments.
- Prefer IQ4_NL for better accuracy.
๐ License
The project uses the Qwen license.
๐ If you find these models useful
Please click like โค. Also, I'd really appreciate it if you could test my Network Monitor Assistant at ๐ Network Monitor Assitant.
๐ฌ Click the chat icon (bottom right of the main and dashboard pages). Choose an LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
What I'm Testing
I'm experimenting with function calling against my network monitoring service. Using small open - source models. I'm interested in the question "How small can it go and still function".
๐ก TestLLM โ Runs the current testing model using llama.cpp on 6 threads of a CPU VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeโstill working on scaling!). If you're curious, I'd be happy to share how it works!
The other Available AI Assistants
๐ข TurboLLM โ Uses gpt - 4o - mini Fast! Note: tokens are limited since OpenAI models are pricey, but you can Login or Download the Free Network Monitor agent to get more tokens. Alternatively, use the TestLLM.
๐ต HugLLM โ Runs open - source Hugging Face models Fast, Runs small models (โ8B) hence lower quality. Get 2x more tokens (subject to Hugging Face API availability).







