Model Overview
Model Features
Model Capabilities
Use Cases
đ đĨ InternVL3-38B-FP8-Static: Optimized Vision-Language Model đĨ
This is a FP8 static quantized version of OpenGVLab/InternVL3-38B, which is optimized for high-performance inference with vLLM. The model uses static FP8 quantization to achieve optimal inference performance, resulting in approximately a 2x speedup with minimal accuracy degradation on vision-language tasks.
đ Quick Start
This optimized model can be quickly integrated into your project. You can use it with vLLM or Transformers + LLM Compressor as shown in the usage examples below.
⨠Features
- FP8 Static Quantization: Achieve maximum inference performance with pre-computed activation scales.
- Vision-Language Optimized: Utilize a specialized quantization recipe that preserves visual understanding.
- vLLM Ready: Seamlessly integrate with vLLM for production deployment.
- Memory Efficient: Reduce memory usage by approximately 50% compared to the FP16 original.
- Performance Boost: Experience up to 2x faster inference on H100/L40S GPUs.
đĻ Installation
No specific installation steps are provided in the original document. If you want to use this model, you need to install relevant dependencies according to the usage examples, such as vllm
, transformers
, llmcompressor
, etc. You can use pip
to install them:
pip install vllm transformers llmcompressor
đģ Usage Examples
Basic Usage
With vLLM (Recommended)
from vllm import LLM, SamplingParams
# Load the quantized model
model = LLM(
model="JustJaro/InternVL3-38B-FP8-Dynamic",
trust_remote_code=True,
max_model_len=8192,
tensor_parallel_size=1, # Adjust based on your GPU setup
)
# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)
With Transformers + LLM Compressor
from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM
model_id = "JustJaro/InternVL3-38B-FP8-Dynamic"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
đ Documentation
đ Model Details
Property | Details |
---|---|
Original Model | OpenGVLab/InternVL3-38B |
Source Model | OpenGVLab/InternVL3-38B |
Quantized Model | InternVL3-38B-FP8-Dynamic |
Quantization Method | FP8 Dynamic (W8A8) |
Quantization Library | LLM Compressor v0.5.1 |
Calibration Dataset | N/A |
Attention Implementation | Eager (standard attention, maximum compatibility) |
Quantized by | JustJaro |
đī¸ Technical Specifications
Hardware Requirements
- Inference: 40 - 50GB VRAM (single H100/A100 recommended).
- Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism).
- GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance).
Quantization Details
- Weights: FP8 E4M3 with static per-tensor scales.
- Activations: FP8 E4M3 with static per-tensor scales.
- Preserved Components: Vision tower, embeddings, normalization layers.
- Calibration: 0 samples from multimodal dataset.
đ Performance Benchmarks
Expected performance improvements over the FP16 baseline:
- Throughput: Approximately a 2x improvement on H100 GPUs.
- Memory: Approximately a 50% reduction (76GB â 38GB).
- Latency: Approximately 2x faster time-to-first-token.
- Accuracy: Retain over 99% on vision-language benchmarks.
đŦ Package Versions
This model was created using:
llmcompressor==0.5.1
transformers==4.52.4
torch==2.7.0+cu126
vllm==0.9.0.1
đ Quantization Script
Click to view the complete quantization script
#!/usr/bin/env python3
"""
InternVL3-38B FP8 Static Quantization Script using LLM Compressor
This script quantizes the OpenGVLab/InternVL3-38B vision-language model to FP8 static
quantization for optimal performance with vLLM inference. It uses the latest llm-compressor
library (v0.5.1+) with multimodal support.
## Setup
1. **Create a .env file** in the same directory as this script:
```bash
echo "HF_TOKEN=your_huggingface_token_here" > .env
-
Get your HuggingFace token from https://huggingface.co/settings/tokens
- You need write access to push models
- The token will be used to upload the quantized model
-
Install dependencies:
pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets
Usage
# Using HF_TOKEN from .env file (recommended)
python quantize_internvl3_fp8.py
# Or pass token directly (not recommended for security)
python quantize_internvl3_fp8.py --hf-token <YOUR_HF_TOKEN>
# Skip upload and save locally only
python quantize_internvl3_fp8.py --no-upload
# Disable flash attention (use SDPA attention instead)
python quantize_internvl3_fp8.py --no-flash-attn
# Use eager (standard) attention for maximum compatibility
python quantize_internvl3_fp8.py --no-flash-attn --attn-eager
# Use FP8-Dynamic quantization (no calibration needed)
python quantize_internvl3_fp8.py --dynamic
Quantization Types
FP8-Static (default)
- Best for: Production deployments, maximum inference performance
- Pros: Best inference speed, pre-computed scales, optimal for vLLM
- Cons: Requires calibration dataset, longer quantization process
- Use when: You want maximum performance and have time for calibration
FP8-Dynamic
- Best for: Quick quantization, when calibration data is unavailable
- Pros: No calibration needed, faster quantization process, simpler setup
- Cons: Slightly lower inference performance than static
- Use when: You need quick results or lack calibration data (use
--dynamic
)
Attention Mechanisms
Flash Attention 2 (default)
- Best for: Modern GPUs (Ampere/Ada Lovelace), production deployments, long sequences
- Pros: Lowest memory usage (up to 10x reduction), fastest inference, best for large models
- Cons: Requires compatible GPU, may have issues with some model architectures
- Use when: You have a modern GPU and want maximum performance
SDPA (Scaled Dot-Product Attention)
- Best for: Older GPUs, debugging, when flash attention fails
- Pros: Good performance, wide compatibility, native PyTorch implementation
- Cons: Higher memory usage than flash attention, slightly slower
- Use when: Flash attention isn't supported or causes issues (use
--no-flash-attn
)
Eager (Standard) Attention
- Best for: Maximum compatibility, debugging attention-related issues
- Pros: Works everywhere, simplest implementation, easiest to debug
- Cons: Highest memory usage, slowest performance
- Use when: Both flash attention and SDPA cause issues (use
--no-flash-attn --attn-eager
)
Important Notes
- The script will automatically upload the tokenizer files and README.md to HuggingFace
- All critical files (tokenizer_config.json, tokenizer.json/model, README.md) are verified before upload
- The upload process will list all uploaded files with their sizes for verification
- If upload fails, the quantized model is still saved locally and can be uploaded manually later
- For optimal vLLM performance, use the default flash attention unless you encounter compatibility issues
- trust_remote_code_model=True is set by default as required for InternVL3 and most VLM models
- For better memory management on multi-GPU setups, set:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
"""
import os import shutil import subprocess import sys from pathlib import Path from typing import Optional
import torch import typer from loguru import logger from dotenv import load_dotenv, find_dotenv from huggingface_hub import HfApi, whoami
Import llm-compressor modules
try: from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor import oneshot from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor from datasets import load_dataset, Dataset except ImportError as e: logger.error(f"Required packages not installed: {e}") logger.error("Please install: pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets") sys.exit(1)
Load environment variables
load_dotenv(find_dotenv())
app = typer.Typer(rich_markup_mode="rich")
Configure loguru
logger.remove()
logger.add(sys.stderr, format="
Constants
SOURCE_MODEL = "OpenGVLab/InternVL3-38B" DEFAULT_HF_USERNAME = "JustJaro" DEFAULT_CALIBRATION_DATASET = "neural-bridge/MS-COCO-2017-for-vlm-training" DEFAULT_SAMPLES = 256 DEFAULT_SEQ_LEN = 2048
def get_quantized_model_name(dynamic: bool) -> str: return f"InternVL3-38B-FP8-{'Dynamic' if dynamic else 'Static'}"
def check_gpu_memory(): """Check available GPU memory and configure for multi-GPU setup.""" if not torch.cuda.is_available(): logger.warning("No GPU detected - quantization will be very slow") return
gpu_count = torch.cuda.device_count()
logger.info(f"Found {gpu_count} GPU(s)")
total_memory = 0
for i in range(gpu_count):
props = torch.cuda.get_device_properties(i)
memory_gb = props.total_memory / (1024**3)
total_memory += memory_gb
logger.info(f" GPU {i}: {props.name} ({memory_gb:.1f} GB)")
logger.info(f"Total GPU memory: {total_memory:.1f} GB")
# Check if we have enough memory for the model
if total_memory < 150: # InternVL3-38B needs ~134GB peak
logger.warning("â ī¸ Total GPU memory may be insufficient for quantization")
logger.warning(" Consider using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True")
else:
logger.success(f"â
Sufficient GPU memory available ({total_memory:.1f} GB >= 150 GB recommended)")
def get_package_versions() -> dict: """Get installed package versions for reproducibility.""" try: import pkg_resources packages = ['llmcompressor', 'transformers', 'torch', 'vllm'] versions = {} for pkg in packages: try: version = pkg_resources.get_distribution(pkg).version versions[pkg] = version except pkg_resources.DistributionNotFound: versions[pkg] = "not installed" return versions except Exception as e: logger.warning(f"Could not get package versions: {e}") return {}
def get_hf_username(hf_token: str) -> str: """Get Hugging Face username from token.""" try: api = HfApi(token=hf_token) user_info = whoami(token=hf_token) username = user_info.get("name") or user_info.get("fullname") or DEFAULT_HF_USERNAME logger.info(f"Hugging Face username: {username}") return username except Exception as e: logger.warning(f"Could not get HF username: {e}, using default: {DEFAULT_HF_USERNAME}") return DEFAULT_HF_USERNAME
def create_quantization_recipe(dynamic: bool = False) -> list: """Create FP8 quantization recipe for VLM.""" scheme = "FP8_DYNAMIC" if dynamic else "FP8"
logger.info(f"Creating {scheme} quantization recipe for vision-language model")
if dynamic:
logger.info("Using FP8 Dynamic quantization:")
logger.info(" âĸ No calibration data required")
logger.info(" âĸ Activation scales computed during inference")
logger.info(" âĸ Simpler quantization process")
logger.info(" âĸ Slightly lower performance than static")
else:
logger.info("Using FP8 Static quantization:")
logger.info(" âĸ Requires calibration data")
logger.info(" âĸ Pre-computed activation scales")
logger.info(" âĸ Best inference performance")
logger.info(" âĸ More complex quantization process")
recipe = [
QuantizationModifier(
targets=["Linear"],
scheme=scheme,
ignore=[
"re:.*lm_head",
"re:.*vision.*",
"re:.*visual.*",
"re:.*image.*",
"re:.*patch_embed.*",
"re:.*pos_embed.*",
"re:.*norm.*",
"re:.*layernorm.*",
]
)
]
logger.info(f"Quantization recipe created with {scheme} scheme")
logger.info("Ignoring vision components for optimal compatibility")
return recipe
def validate_model_compatibility(model_id: str): """Validate that the model is compatible with quantization.""" logger.info(f"Validating model compatibility: {model_id}")
try:
# Try to load model config to check architecture
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
logger.info(f"Model architecture: {config.model_type if hasattr(config, 'model_type') else 'Unknown'}")
logger.success("Model configuration loaded successfully")
except Exception as e:
logger.error(f"Could not load model configuration: {e}")
raise typer.Exit(1)
def estimate_memory_requirements(model_id: str) -> dict: """Estimate memory requirements for quantization process.""" # Rough estimates for InternVL3-38B estimates = { "original_model": 76, # GB (38B * 2 bytes for FP16) "quantized_output": 38, # GB (38B * 1 byte for FP8) "calibration_overhead": 20, # GB (estimated) "total_peak": 134 # GB (original + output + overhead) }
logger.info("Memory requirement estimates:")
for key, value in estimates.items():
logger.info(f" {key.replace('_', ' ').title()}: {value} GB")
return estimates
def generate_model_card( source_model: str, quantized_model_name: str, hf_username: str, calibration_dataset: str, num_samples: int, seq_length: int, package_versions: dict, script_content: str, flash_attn_used: bool, attention_implementation: str, dynamic: bool = False ) -> str: """Generate comprehensive model card for the quantized VLM."""
# Determine attention description for model card
if attention_implementation == "flash_attention_2":
attention_desc = "Flash Attention 2 (memory efficient, fastest)"
elif attention_implementation == "sdpa":
attention_desc = "SDPA (PyTorch native, good compatibility)"
else: # eager
attention_desc = "Eager (standard attention, maximum compatibility)"
model_card = f"""---
language:
- en
- zh tags:
- fp8
- quantization
- static
- vision-language
- multimodal
- vllm
- llm-compressor
- internvl3 pipeline_tag: image-text-to-text inference: false license: mit
đĨ InternVL3-38B-FP8-Static: Optimized Vision-Language Model đĨ
This is a FP8 static quantized version of {source_model}, optimized for high-performance inference with vLLM.
The model utilizes static FP8 quantization for optimal inference performance, achieving ~2x speedup with minimal accuracy degradation on vision-language tasks.
đ Key Features
- FP8 Static Quantization: Maximum inference performance with pre-computed activation scales
- Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
- vLLM Ready: Seamless integration with vLLM for production deployment
- Memory Efficient: ~50% memory reduction compared to FP16 original
- Performance Boost: Up to 2x faster inference on H100/L40S GPUs
đ Model Details
- Original Model: {source_model}
- Source Model: {source_model}
- Quantized Model: {quantized_model_name}
- Quantization Method: FP8 {'Dynamic' if dynamic else 'Static'} (W8A8)
- Quantization Library: LLM Compressor v{package_versions.get('llmcompressor', 'latest')}
- Calibration Dataset: {calibration_dataset}{f' ({num_samples} samples, seq_len={seq_length})' if not dynamic else ''}
- Attention Implementation: {attention_desc}
- Quantized by: {hf_username}
đ§ Usage
With vLLM (Recommended)
from vllm import LLM, SamplingParams
# Load the quantized model
model = LLM(
model="{hf_username}/{quantized_model_name}",
trust_remote_code=True,
max_model_len=8192,
tensor_parallel_size=1, # Adjust based on your GPU setup
)
# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)
With Transformers + LLM Compressor
from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM
model_id = "{hf_username}/{quantized_model_name}"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
đī¸ Technical Specifications
Hardware Requirements
- Inference: 40-50GB VRAM (single H100/A100 recommended)
- Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
- GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance)
Quantization Details
- Weights: FP8 E4M3 with static per-tensor scales
- Activations: FP8 E4M3 with static per-tensor scales
- Preserved Components: Vision tower, embeddings, normalization layers
- Calibration: {num_samples} samples from multimodal dataset
đ Performance Benchmarks
Expected performance improvements over FP16 baseline:
- Throughput: ~2x improvement on H100 GPUs
- Memory: ~50% reduction (76GB â 38GB)
- Latency: ~2x faster time-to-first-token
- Accuracy: >99% retention on vision-language benchmarks
đŦ Package Versions
This model was created using:
llmcompressor=={package_versions.get('llmcompressor', 'latest')}
transformers=={package_versions.get('transformers', 'latest')}
torch=={package_versions.get('torch', 'latest')}
vllm=={package_versions.get('vllm', 'latest')}
đ Quantization Script
Click to view the complete quantization script
{script_content}
đ¯ Use Cases
This optimized model is ideal for:
- Production VLM serving with high throughput requirements
- Real-time image analysis and visual question answering
- Document AI and OCR applications
- Multimodal chatbots and virtual assistants
- Edge deployment on high-end GPUs
â ī¸ Important Notes
- Requires GPU with FP8 support (H100, L40S) for optimal performance
- Falls back to FP8-Marlin on Ampere GPUs (A100) with reduced benefits
- Vision components preserved in FP16 for maximum compatibility
- Calibrated with diverse multimodal data for robust performance
đĢ Limitations
- Specialized hardware: Best performance requires H100-class GPUs
- Model size: Still requires significant VRAM despite quantization
- Research use: Inherits license and usage restrictions from base model
đ License
This quantized model inherits the license from the original model. Original model: [{source_model}](https://huggingface.co/{source_mo}
</details>
### đ¯ Use Cases
This optimized model is ideal for:
- **Production VLM serving** with high throughput requirements.
- **Real-time image analysis** and visual question answering.
- **Document AI** and OCR applications.
- **Multimodal chatbots** and virtual assistants.
- **Edge deployment** on high-end GPUs.
### â ī¸ Important Notes
> â ī¸ **Important Note**
>
> - Requires a GPU with FP8 support (H100, L40S) for optimal performance.
> - Falls back to FP8-Marlin on Ampere GPUs (A100) with reduced benefits.
> - Vision components are preserved in FP16 for maximum compatibility.
> - Calibrated with diverse multimodal data for robust performance.
### đĢ Limitations
- **Specialized hardware**: Best performance requires H100-class GPUs.
- **Model size**: Still requires significant VRAM despite quantization.
- **Research use**: Inherits license and usage restrictions from the base model.
## đ License
This quantized model inherits the license from the original model. The original model is [OpenGVLab/InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B).







