Internvl3 38B FP8 Dynamic

Developed by ConfidentialMind

This is the FP8 static quantization version of OpenGVLab/InternVL3-38B, optimized for high-performance inference using vLLM. It achieves approximately 2x acceleration on vision-language tasks with minimal accuracy loss.

Text-to-Image

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #FP8 Quantization Acceleration #Multimodal Inference #Large Vision-Language Model

Downloads 5,173

Release Time : 5/31/2025

Model Overview

An optimized vision-language model that enables high-performance inference through FP8 static quantization, suitable for multimodal tasks.

Model Features

FP8 Static Quantization

Achieves maximum inference performance through precomputed activation scales

Vision-Language Optimization

A specialized quantization method that preserves visual understanding capabilities

Supports vLLM

Can be seamlessly integrated with vLLM for easy production deployment

Memory Efficient

Reduces memory usage by approximately 50% compared to the original FP16 version

Performance Improvement

Inference speed can be increased by up to 2x on H100/L40S GPUs

Model Capabilities

Image Understanding

Text Generation

Visual Question Answering

Multimodal Inference

Use Cases

Production Environment Services

Real-Time Image Analysis

Used for vision-language model services that require high throughput

Approximately 2x increase in inference speed

Document Processing

Document AI and OCR

Processes documents containing images and text

Interactive Applications

Multimodal Chatbot

Builds virtual assistants capable of understanding images and text

🚀 🔥 InternVL3-38B-FP8-Static: Optimized Vision-Language Model 🔥

This is a FP8 static quantized version of OpenGVLab/InternVL3-38B, which is optimized for high-performance inference with vLLM. The model uses static FP8 quantization to achieve optimal inference performance, resulting in approximately a 2x speedup with minimal accuracy degradation on vision-language tasks.

🚀 Quick Start

This optimized model can be quickly integrated into your project. You can use it with vLLM or Transformers + LLM Compressor as shown in the usage examples below.

✨ Features

FP8 Static Quantization: Achieve maximum inference performance with pre-computed activation scales.
Vision-Language Optimized: Utilize a specialized quantization recipe that preserves visual understanding.
vLLM Ready: Seamlessly integrate with vLLM for production deployment.
Memory Efficient: Reduce memory usage by approximately 50% compared to the FP16 original.
Performance Boost: Experience up to 2x faster inference on H100/L40S GPUs.

📦 Installation

No specific installation steps are provided in the original document. If you want to use this model, you need to install relevant dependencies according to the usage examples, such as vllm, transformers, llmcompressor, etc. You can use pip to install them:

pip install vllm transformers llmcompressor

💻 Usage Examples

Basic Usage

With vLLM (Recommended)

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM(
    model="JustJaro/InternVL3-38B-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=8192,
    tensor_parallel_size=1,  # Adjust based on your GPU setup
)

# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)

With Transformers + LLM Compressor

from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM

model_id = "JustJaro/InternVL3-38B-FP8-Dynamic"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

📚 Documentation

📊 Model Details

Property	Details
Original Model	OpenGVLab/InternVL3-38B
Source Model	OpenGVLab/InternVL3-38B
Quantized Model	InternVL3-38B-FP8-Dynamic
Quantization Method	FP8 Dynamic (W8A8)
Quantization Library	LLM Compressor v0.5.1
Calibration Dataset	N/A
Attention Implementation	Eager (standard attention, maximum compatibility)
Quantized by	JustJaro

🏗️ Technical Specifications

Hardware Requirements

Inference: 40 - 50GB VRAM (single H100/A100 recommended).
Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism).
GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance).

Quantization Details

Weights: FP8 E4M3 with static per-tensor scales.
Activations: FP8 E4M3 with static per-tensor scales.
Preserved Components: Vision tower, embeddings, normalization layers.
Calibration: 0 samples from multimodal dataset.

📈 Performance Benchmarks

Expected performance improvements over the FP16 baseline:

Throughput: Approximately a 2x improvement on H100 GPUs.
Memory: Approximately a 50% reduction (76GB → 38GB).
Latency: Approximately 2x faster time-to-first-token.
Accuracy: Retain over 99% on vision-language benchmarks.

🔬 Package Versions

This model was created using:

llmcompressor==0.5.1
transformers==4.52.4
torch==2.7.0+cu126
vllm==0.9.0.1

📋 Quantization Script

Click to view the complete quantization script

#!/usr/bin/env python3
"""
InternVL3-38B FP8 Static Quantization Script using LLM Compressor

This script quantizes the OpenGVLab/InternVL3-38B vision-language model to FP8 static 
quantization for optimal performance with vLLM inference. It uses the latest llm-compressor
library (v0.5.1+) with multimodal support.

## Setup

1. **Create a .env file** in the same directory as this script:
   ```bash
   echo "HF_TOKEN=your_huggingface_token_here" > .env

Get your HuggingFace token from https://huggingface.co/settings/tokens
- You need write access to push models
- The token will be used to upload the quantized model

Install dependencies:

pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets

Usage

# Using HF_TOKEN from .env file (recommended)
python quantize_internvl3_fp8.py

# Or pass token directly (not recommended for security)
python quantize_internvl3_fp8.py --hf-token <YOUR_HF_TOKEN>

# Skip upload and save locally only
python quantize_internvl3_fp8.py --no-upload

# Disable flash attention (use SDPA attention instead)
python quantize_internvl3_fp8.py --no-flash-attn

# Use eager (standard) attention for maximum compatibility
python quantize_internvl3_fp8.py --no-flash-attn --attn-eager

# Use FP8-Dynamic quantization (no calibration needed)
python quantize_internvl3_fp8.py --dynamic

Quantization Types

FP8-Static (default)

Best for: Production deployments, maximum inference performance
Pros: Best inference speed, pre-computed scales, optimal for vLLM
Cons: Requires calibration dataset, longer quantization process
Use when: You want maximum performance and have time for calibration

FP8-Dynamic

Best for: Quick quantization, when calibration data is unavailable
Pros: No calibration needed, faster quantization process, simpler setup
Cons: Slightly lower inference performance than static
Use when: You need quick results or lack calibration data (use --dynamic)

Attention Mechanisms

Flash Attention 2 (default)

Best for: Modern GPUs (Ampere/Ada Lovelace), production deployments, long sequences
Pros: Lowest memory usage (up to 10x reduction), fastest inference, best for large models
Cons: Requires compatible GPU, may have issues with some model architectures
Use when: You have a modern GPU and want maximum performance

SDPA (Scaled Dot-Product Attention)

Best for: Older GPUs, debugging, when flash attention fails
Pros: Good performance, wide compatibility, native PyTorch implementation
Cons: Higher memory usage than flash attention, slightly slower
Use when: Flash attention isn't supported or causes issues (use --no-flash-attn)

Eager (Standard) Attention

Best for: Maximum compatibility, debugging attention-related issues
Pros: Works everywhere, simplest implementation, easiest to debug
Cons: Highest memory usage, slowest performance
Use when: Both flash attention and SDPA cause issues (use --no-flash-attn --attn-eager)

Important Notes

The script will automatically upload the tokenizer files and README.md to HuggingFace
All critical files (tokenizer_config.json, tokenizer.json/model, README.md) are verified before upload
The upload process will list all uploaded files with their sizes for verification
If upload fails, the quantized model is still saved locally and can be uploaded manually later
For optimal vLLM performance, use the default flash attention unless you encounter compatibility issues
trust_remote_code_model=True is set by default as required for InternVL3 and most VLM models
For better memory management on multi-GPU setups, set: export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True """

import os import shutil import subprocess import sys from pathlib import Path from typing import Optional

import torch import typer from loguru import logger from dotenv import load_dotenv, find_dotenv from huggingface_hub import HfApi, whoami

Import llm-compressor modules

try: from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor import oneshot from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor from datasets import load_dataset, Dataset except ImportError as e: logger.error(f"Required packages not installed: {e}") logger.error("Please install: pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets") sys.exit(1)

Load environment variables

load_dotenv(find_dotenv())

app = typer.Typer(rich_markup_mode="rich")

Configure loguru

logger.remove() logger.add(sys.stderr, format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}") logger.add("quantization.log", format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}")

Constants

SOURCE_MODEL = "OpenGVLab/InternVL3-38B" DEFAULT_HF_USERNAME = "JustJaro" DEFAULT_CALIBRATION_DATASET = "neural-bridge/MS-COCO-2017-for-vlm-training" DEFAULT_SAMPLES = 256 DEFAULT_SEQ_LEN = 2048

def get_quantized_model_name(dynamic: bool) -> str: return f"InternVL3-38B-FP8-{'Dynamic' if dynamic else 'Static'}"

def check_gpu_memory(): """Check available GPU memory and configure for multi-GPU setup.""" if not torch.cuda.is_available(): logger.warning("No GPU detected - quantization will be very slow") return

gpu_count = torch.cuda.device_count()
logger.info(f"Found {gpu_count} GPU(s)")

total_memory = 0
for i in range(gpu_count):
    props = torch.cuda.get_device_properties(i)
    memory_gb = props.total_memory / (1024**3)
    total_memory += memory_gb
    logger.info(f"  GPU {i}: {props.name} ({memory_gb:.1f} GB)")

logger.info(f"Total GPU memory: {total_memory:.1f} GB")

# Check if we have enough memory for the model
if total_memory < 150:  # InternVL3-38B needs ~134GB peak
    logger.warning("⚠️  Total GPU memory may be insufficient for quantization")
    logger.warning("   Consider using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True")
else:
    logger.success(f"✅ Sufficient GPU memory available ({total_memory:.1f} GB >= 150 GB recommended)")

def get_package_versions() -> dict: """Get installed package versions for reproducibility.""" try: import pkg_resources packages = ['llmcompressor', 'transformers', 'torch', 'vllm'] versions = {} for pkg in packages: try: version = pkg_resources.get_distribution(pkg).version versions[pkg] = version except pkg_resources.DistributionNotFound: versions[pkg] = "not installed" return versions except Exception as e: logger.warning(f"Could not get package versions: {e}") return {}

def get_hf_username(hf_token: str) -> str: """Get Hugging Face username from token.""" try: api = HfApi(token=hf_token) user_info = whoami(token=hf_token) username = user_info.get("name") or user_info.get("fullname") or DEFAULT_HF_USERNAME logger.info(f"Hugging Face username: {username}") return username except Exception as e: logger.warning(f"Could not get HF username: {e}, using default: {DEFAULT_HF_USERNAME}") return DEFAULT_HF_USERNAME

def create_quantization_recipe(dynamic: bool = False) -> list: """Create FP8 quantization recipe for VLM.""" scheme = "FP8_DYNAMIC" if dynamic else "FP8"

logger.info(f"Creating {scheme} quantization recipe for vision-language model")

if dynamic:
    logger.info("Using FP8 Dynamic quantization:")
    logger.info("  • No calibration data required")
    logger.info("  • Activation scales computed during inference")
    logger.info("  • Simpler quantization process")
    logger.info("  • Slightly lower performance than static")
else:
    logger.info("Using FP8 Static quantization:")
    logger.info("  • Requires calibration data")
    logger.info("  • Pre-computed activation scales")
    logger.info("  • Best inference performance")
    logger.info("  • More complex quantization process")

recipe = [
    QuantizationModifier(
        targets=["Linear"],
        scheme=scheme,
        ignore=[
            "re:.*lm_head",
            "re:.*vision.*",
            "re:.*visual.*",  
            "re:.*image.*",
            "re:.*patch_embed.*",
            "re:.*pos_embed.*",
            "re:.*norm.*",
            "re:.*layernorm.*",
        ]
    )
]

logger.info(f"Quantization recipe created with {scheme} scheme")
logger.info("Ignoring vision components for optimal compatibility")

return recipe

def validate_model_compatibility(model_id: str): """Validate that the model is compatible with quantization.""" logger.info(f"Validating model compatibility: {model_id}")

try:
    # Try to load model config to check architecture
    from transformers import AutoConfig
    config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
    logger.info(f"Model architecture: {config.model_type if hasattr(config, 'model_type') else 'Unknown'}")
    logger.success("Model configuration loaded successfully")
except Exception as e:
    logger.error(f"Could not load model configuration: {e}")
    raise typer.Exit(1)

def estimate_memory_requirements(model_id: str) -> dict: """Estimate memory requirements for quantization process.""" # Rough estimates for InternVL3-38B estimates = { "original_model": 76, # GB (38B * 2 bytes for FP16) "quantized_output": 38, # GB (38B * 1 byte for FP8) "calibration_overhead": 20, # GB (estimated) "total_peak": 134 # GB (original + output + overhead) }

logger.info("Memory requirement estimates:")
for key, value in estimates.items():
    logger.info(f"  {key.replace('_', ' ').title()}: {value} GB")

return estimates

def generate_model_card( source_model: str, quantized_model_name: str, hf_username: str, calibration_dataset: str, num_samples: int, seq_length: int, package_versions: dict, script_content: str, flash_attn_used: bool, attention_implementation: str, dynamic: bool = False ) -> str: """Generate comprehensive model card for the quantized VLM."""

# Determine attention description for model card
if attention_implementation == "flash_attention_2":
    attention_desc = "Flash Attention 2 (memory efficient, fastest)"
elif attention_implementation == "sdpa":
    attention_desc = "SDPA (PyTorch native, good compatibility)"
else:  # eager
    attention_desc = "Eager (standard attention, maximum compatibility)"

model_card = f"""---

language:

en
zh tags:
fp8
quantization
static
vision-language
multimodal
vllm
llm-compressor
internvl3 pipeline_tag: image-text-to-text inference: false license: mit

🔥 InternVL3-38B-FP8-Static: Optimized Vision-Language Model 🔥

This is a FP8 static quantized version of {source_model}, optimized for high-performance inference with vLLM.

The model utilizes static FP8 quantization for optimal inference performance, achieving ~2x speedup with minimal accuracy degradation on vision-language tasks.

🚀 Key Features

FP8 Static Quantization: Maximum inference performance with pre-computed activation scales
Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
vLLM Ready: Seamless integration with vLLM for production deployment
Memory Efficient: ~50% memory reduction compared to FP16 original
Performance Boost: Up to 2x faster inference on H100/L40S GPUs

📊 Model Details

Original Model: {source_model}
Source Model: {source_model}
Quantized Model: {quantized_model_name}
Quantization Method: FP8 {'Dynamic' if dynamic else 'Static'} (W8A8)
Quantization Library: LLM Compressor v{package_versions.get('llmcompressor', 'latest')}
Calibration Dataset: {calibration_dataset}{f' ({num_samples} samples, seq_len={seq_length})' if not dynamic else ''}
Attention Implementation: {attention_desc}
Quantized by: {hf_username}

🔧 Usage

With vLLM (Recommended)

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM(
    model="{hf_username}/{quantized_model_name}",
    trust_remote_code=True,
    max_model_len=8192,
    tensor_parallel_size=1,  # Adjust based on your GPU setup
)

# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)

With Transformers + LLM Compressor

from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM

model_id = "{hf_username}/{quantized_model_name}"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

🏗️ Technical Specifications

Hardware Requirements

Inference: 40-50GB VRAM (single H100/A100 recommended)
Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance)

Quantization Details

Weights: FP8 E4M3 with static per-tensor scales
Activations: FP8 E4M3 with static per-tensor scales
Preserved Components: Vision tower, embeddings, normalization layers
Calibration: {num_samples} samples from multimodal dataset

📈 Performance Benchmarks

Expected performance improvements over FP16 baseline:

Throughput: ~2x improvement on H100 GPUs
Memory: ~50% reduction (76GB → 38GB)
Latency: ~2x faster time-to-first-token
Accuracy: >99% retention on vision-language benchmarks

🔬 Package Versions

This model was created using:

llmcompressor=={package_versions.get('llmcompressor', 'latest')}
transformers=={package_versions.get('transformers', 'latest')}
torch=={package_versions.get('torch', 'latest')}
vllm=={package_versions.get('vllm', 'latest')}

📋 Quantization Script

Click to view the complete quantization script

{script_content}

🎯 Use Cases

This optimized model is ideal for:

Production VLM serving with high throughput requirements
Real-time image analysis and visual question answering
Document AI and OCR applications
Multimodal chatbots and virtual assistants
Edge deployment on high-end GPUs

⚠️ Important Notes

Requires GPU with FP8 support (H100, L40S) for optimal performance
Falls back to FP8-Marlin on Ampere GPUs (A100) with reduced benefits
Vision components preserved in FP16 for maximum compatibility
Calibrated with diverse multimodal data for robust performance

🚫 Limitations

Specialized hardware: Best performance requires H100-class GPUs
Model size: Still requires significant VRAM despite quantization
Research use: Inherits license and usage restrictions from base model

📄 License

This quantized model inherits the license from the original model. Original model: [{source_model}](https://huggingface.co/{source_mo}


</details>

### 🎯 Use Cases
This optimized model is ideal for:
- **Production VLM serving** with high throughput requirements.
- **Real-time image analysis** and visual question answering.
- **Document AI** and OCR applications.
- **Multimodal chatbots** and virtual assistants.
- **Edge deployment** on high-end GPUs.

### ⚠️ Important Notes
> ⚠️ **Important Note**
> 
> - Requires a GPU with FP8 support (H100, L40S) for optimal performance.
> - Falls back to FP8-Marlin on Ampere GPUs (A100) with reduced benefits.
> - Vision components are preserved in FP16 for maximum compatibility.
> - Calibrated with diverse multimodal data for robust performance.

### 🚫 Limitations
- **Specialized hardware**: Best performance requires H100-class GPUs.
- **Model size**: Still requires significant VRAM despite quantization.
- **Research use**: Inherits license and usage restrictions from the base model.

## 📄 License
This quantized model inherits the license from the original model. The original model is [OpenGVLab/InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご