Internvl3 38B FP8 Dynamic

ConfidentialMindによって開発

これはOpenGVLab/InternVL3-38BのFP8静的量子化バージョンで、vLLMを使用した高性能推論に最適化されており、ビジュアル言語タスクで約2倍の高速化を実現し、同時に精度の損失は極めて少ないです。

テキスト生成画像

Safetensors

複数言語対応オープンソースライセンス:MIT #FP8量子化加速 #マルチモーダル推論 #ビジュアル言語大規模モデル

ダウンロード数 5,173

リリース時間 : 5/31/2025

モデル概要

最適化されたビジュアル言語モデルで、FP8静的量子化により高性能推論を実現し、マルチモーダルタスクに適しています。

モデル特徴

FP8静的量子化

事前計算された活性化尺度により最大の推論性能を実現します

ビジュアル言語最適化

専用の量子化方法で、ビジュアル理解能力を保持します

vLLMサポート

vLLMとシームレスに統合でき、本番環境でのデプロイが容易です

メモリ効率化

元のFP16バージョンと比較して、メモリ使用量が約50％削減されます

性能向上

H100/L40S GPUでの推論速度が最大2倍に向上します

モデル能力

画像理解

テキスト生成

ビジュアル質問応答

マルチモーダル推論

使用事例

本番環境サービス

リアルタイム画像分析

高スループットが必要なビジュアル言語モデルサービスに使用されます

約2倍の推論速度向上

文書処理

文書AIとOCR

画像とテキストを含む文書を処理します

インタラクティブアプリケーション

マルチモーダルチャットボット

画像とテキストを理解できるバーチャルアシスタントを構築します

🚀 🔥InternVL3-38B-FP8-Static: 最適化されたビジョン言語モデル🔥

これは、OpenGVLab/InternVL3-38B の FP8静的量子化 バージョンで、vLLMを用いた高性能推論に最適化されています。このモデルは、静的FP8量子化 を利用して最適な推論性能を実現し、ビジョン言語タスクで精度の低下を最小限に抑えながら、約2倍の高速化を達成します。

🚀 クイックスタート

このモデルは、vLLMを用いた高性能推論に最適化されたビジョン言語モデルです。以下のセクションでは、このモデルの主な機能、インストール方法、使用例、技術詳細などについて説明します。

✨ 主な機能

FP8静的量子化：事前計算されたアクティベーションスケールにより、最大限の推論性能を実現します。
ビジョン言語最適化：視覚理解を維持するための特殊な量子化レシピが適用されています。
vLLM対応：vLLMとのシームレスな統合により、本番環境でのデプロイが容易です。
メモリ効率化：FP16のオリジナルモデルと比較して、約50%のメモリ削減が実現されています。
性能向上：H100/L40S GPUでは、最大2倍の高速な推論が可能です。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを実行して、必要なライブラリをインストールしてください。

llmcompressor==0.5.1
transformers==4.52.4
torch==2.7.0+cu126
vllm==0.9.0.1

💻 使用例

基本的な使用法

vLLMを使用する場合（推奨）

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM(
    model="JustJaro/InternVL3-38B-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=8192,
    tensor_parallel_size=1,  # Adjust based on your GPU setup
)

# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)

Transformers + LLM Compressorを使用する場合

from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM

model_id = "JustJaro/InternVL3-38B-FP8-Dynamic"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

📚 ドキュメント

📊 モデル詳細

プロパティ	詳細
元のモデル	OpenGVLab/InternVL3-38B
ソースモデル	OpenGVLab/InternVL3-38B
量子化モデル	InternVL3-38B-FP8-Dynamic
量子化方法	FP8 Dynamic (W8A8)
量子化ライブラリ	LLM Compressor v0.5.1
キャリブレーションデータセット	N/A
アテンション実装	Eager (標準アテンション、最大の互換性)
量子化者	JustJaro

🏗️ 技術仕様

ハードウェア要件

推論：40 - 50GBのVRAM（単一のH100/A100を推奨）
対応GPU：H100、L40S、A100 (80GB)、RTX 4090 (テンソル並列用に2台)
GPUアーキテクチャ：Ada Lovelace、Hopper（最適なFP8性能のため）

量子化詳細

重み：静的なテンソルごとのスケールを持つFP8 E4M3
アクティベーション：静的なテンソルごとのスケールを持つFP8 E4M3
保持されるコンポーネント：ビジョンタワー、埋め込み、正規化レイヤー
キャリブレーション：マルチモーダルデータセットから0サンプル

📈 性能ベンチマーク

FP16ベースラインと比較した予想される性能向上：

スループット：H100 GPUで約2倍の改善
メモリ：約50%の削減（76GB → 38GB）
レイテンシー：最初のトークンまでの時間が約2倍高速化
精度：ビジョン言語ベンチマークで99%以上の維持

🔬 パッケージバージョン

このモデルは、以下のパッケージバージョンを使用して作成されました。

llmcompressor==0.5.1
transformers==4.52.4
torch==2.7.0+cu126
vllm==0.9.0.1

📋 量子化スクリプト

完全な量子化スクリプトを表示するにはクリック

#!/usr/bin/env python3
"""
InternVL3-38B FP8 Static Quantization Script using LLM Compressor

This script quantizes the OpenGVLab/InternVL3-38B vision-language model to FP8 static 
quantization for optimal performance with vLLM inference. It uses the latest llm-compressor
library (v0.5.1+) with multimodal support.

## Setup

1. **Create a .env file** in the same directory as this script:
   ```bash
   echo "HF_TOKEN=your_huggingface_token_here" > .env

Get your HuggingFace token from https://huggingface.co/settings/tokens
- You need write access to push models
- The token will be used to upload the quantized model

Install dependencies:

pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets

Usage

# Using HF_TOKEN from .env file (recommended)
python quantize_internvl3_fp8.py

# Or pass token directly (not recommended for security)
python quantize_internvl3_fp8.py --hf-token <YOUR_HF_TOKEN>

# Skip upload and save locally only
python quantize_internvl3_fp8.py --no-upload

# Disable flash attention (use SDPA attention instead)
python quantize_internvl3_fp8.py --no-flash-attn

# Use eager (standard) attention for maximum compatibility
python quantize_internvl3_fp8.py --no-flash-attn --attn-eager

# Use FP8-Dynamic quantization (no calibration needed)
python quantize_internvl3_fp8.py --dynamic

Quantization Types

FP8-Static (default)

Best for: Production deployments, maximum inference performance
Pros: Best inference speed, pre-computed scales, optimal for vLLM
Cons: Requires calibration dataset, longer quantization process
Use when: You want maximum performance and have time for calibration

FP8-Dynamic

Best for: Quick quantization, when calibration data is unavailable
Pros: No calibration needed, faster quantization process, simpler setup
Cons: Slightly lower inference performance than static
Use when: You need quick results or lack calibration data (use --dynamic)

Attention Mechanisms

Flash Attention 2 (default)

Best for: Modern GPUs (Ampere/Ada Lovelace), production deployments, long sequences
Pros: Lowest memory usage (up to 10x reduction), fastest inference, best for large models
Cons: Requires compatible GPU, may have issues with some model architectures
Use when: You have a modern GPU and want maximum performance

SDPA (Scaled Dot-Product Attention)

Best for: Older GPUs, debugging, when flash attention fails
Pros: Good performance, wide compatibility, native PyTorch implementation
Cons: Higher memory usage than flash attention, slightly slower
Use when: Flash attention isn't supported or causes issues (use --no-flash-attn)

Eager (Standard) Attention

Best for: Maximum compatibility, debugging attention-related issues
Pros: Works everywhere, simplest implementation, easiest to debug
Cons: Highest memory usage, slowest performance
Use when: Both flash attention and SDPA cause issues (use --no-flash-attn --attn-eager)

Important Notes

The script will automatically upload the tokenizer files and README.md to HuggingFace
All critical files (tokenizer_config.json, tokenizer.json/model, README.md) are verified before upload
The upload process will list all uploaded files with their sizes for verification
If upload fails, the quantized model is still saved locally and can be uploaded manually later
For optimal vLLM performance, use the default flash attention unless you encounter compatibility issues
trust_remote_code_model=True is set by default as required for InternVL3 and most VLM models
For better memory management on multi-GPU setups, set: export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True """

import os import shutil import subprocess import sys from pathlib import Path from typing import Optional

import torch import typer from loguru import logger from dotenv import load_dotenv, find_dotenv from huggingface_hub import HfApi, whoami

Import llm-compressor modules

try: from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor import oneshot from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor from datasets import load_dataset, Dataset except ImportError as e: logger.error(f"Required packages not installed: {e}") logger.error("Please install: pip install llmcompressor>=0.5.1 transformers torch loguru typer python-dotenv datasets") sys.exit(1)

Load environment variables

load_dotenv(find_dotenv())

app = typer.Typer(rich_markup_mode="rich")

Configure loguru

logger.remove() logger.add(sys.stderr, format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}") logger.add("quantization.log", format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}")

Constants

SOURCE_MODEL = "OpenGVLab/InternVL3-38B" DEFAULT_HF_USERNAME = "JustJaro" DEFAULT_CALIBRATION_DATASET = "neural-bridge/MS-COCO-2017-for-vlm-training" DEFAULT_SAMPLES = 256 DEFAULT_SEQ_LEN = 2048

def get_quantized_model_name(dynamic: bool) -> str: return f"InternVL3-38B-FP8-{'Dynamic' if dynamic else 'Static'}"

def check_gpu_memory(): """Check available GPU memory and configure for multi-GPU setup.""" if not torch.cuda.is_available(): logger.warning("No GPU detected - quantization will be very slow") return

gpu_count = torch.cuda.device_count()
logger.info(f"Found {gpu_count} GPU(s)")

total_memory = 0
for i in range(gpu_count):
    props = torch.cuda.get_device_properties(i)
    memory_gb = props.total_memory / (1024**3)
    total_memory += memory_gb
    logger.info(f"  GPU {i}: {props.name} ({memory_gb:.1f} GB)")

logger.info(f"Total GPU memory: {total_memory:.1f} GB")

# Check if we have enough memory for the model
if total_memory < 150:  # InternVL3-38B needs ~134GB peak
    logger.warning("⚠️  Total GPU memory may be insufficient for quantization")
    logger.warning("   Consider using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True")
else:
    logger.success(f"✅ Sufficient GPU memory available ({total_memory:.1f} GB >= 150 GB recommended)")

def get_package_versions() -> dict: """Get installed package versions for reproducibility.""" try: import pkg_resources packages = ['llmcompressor', 'transformers', 'torch', 'vllm'] versions = {} for pkg in packages: try: version = pkg_resources.get_distribution(pkg).version versions[pkg] = version except pkg_resources.DistributionNotFound: versions[pkg] = "not installed" return versions except Exception as e: logger.warning(f"Could not get package versions: {e}") return {}

def get_hf_username(hf_token: str) -> str: """Get Hugging Face username from token.""" try: api = HfApi(token=hf_token) user_info = whoami(token=hf_token) username = user_info.get("name") or user_info.get("fullname") or DEFAULT_HF_USERNAME logger.info(f"Hugging Face username: {username}") return username except Exception as e: logger.warning(f"Could not get HF username: {e}, using default: {DEFAULT_HF_USERNAME}") return DEFAULT_HF_USERNAME

def create_quantization_recipe(dynamic: bool = False) -> list: """Create FP8 quantization recipe for VLM.""" scheme = "FP8_DYNAMIC" if dynamic else "FP8"

logger.info(f"Creating {scheme} quantization recipe for vision-language model")

if dynamic:
    logger.info("Using FP8 Dynamic quantization:")
    logger.info("  • No calibration data required")
    logger.info("  • Activation scales computed during inference")
    logger.info("  • Simpler quantization process")
    logger.info("  • Slightly lower performance than static")
else:
    logger.info("Using FP8 Static quantization:")
    logger.info("  • Requires calibration data")
    logger.info("  • Pre-computed activation scales")
    logger.info("  • Best inference performance")
    logger.info("  • More complex quantization process")

recipe = [
    QuantizationModifier(
        targets=["Linear"],
        scheme=scheme,
        ignore=[
            "re:.*lm_head",
            "re:.*vision.*",
            "re:.*visual.*",  
            "re:.*image.*",
            "re:.*patch_embed.*",
            "re:.*pos_embed.*",
            "re:.*norm.*",
            "re:.*layernorm.*",
        ]
    )
]

logger.info(f"Quantization recipe created with {scheme} scheme")
logger.info("Ignoring vision components for optimal compatibility")

return recipe

def validate_model_compatibility(model_id: str): """Validate that the model is compatible with quantization.""" logger.info(f"Validating model compatibility: {model_id}")

try:
    # Try to load model config to check architecture
    from transformers import AutoConfig
    config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
    logger.info(f"Model architecture: {config.model_type if hasattr(config, 'model_type') else 'Unknown'}")
    logger.success("Model configuration loaded successfully")
except Exception as e:
    logger.error(f"Could not load model configuration: {e}")
    raise typer.Exit(1)

def estimate_memory_requirements(model_id: str) -> dict: """Estimate memory requirements for quantization process.""" # Rough estimates for InternVL3-38B estimates = { "original_model": 76, # GB (38B * 2 bytes for FP16) "quantized_output": 38, # GB (38B * 1 byte for FP8) "calibration_overhead": 20, # GB (estimated) "total_peak": 134 # GB (original + output + overhead) }

logger.info("Memory requirement estimates:")
for key, value in estimates.items():
    logger.info(f"  {key.replace('_', ' ').title()}: {value} GB")

return estimates

def generate_model_card( source_model: str, quantized_model_name: str, hf_username: str, calibration_dataset: str, num_samples: int, seq_length: int, package_versions: dict, script_content: str, flash_attn_used: bool, attention_implementation: str, dynamic: bool = False ) -> str: """Generate comprehensive model card for the quantized VLM."""

# Determine attention description for model card
if attention_implementation == "flash_attention_2":
    attention_desc = "Flash Attention 2 (memory efficient, fastest)"
elif attention_implementation == "sdpa":
    attention_desc = "SDPA (PyTorch native, good compatibility)"
else:  # eager
    attention_desc = "Eager (standard attention, maximum compatibility)"

model_card = f"""---

language:

en
zh tags:
fp8
quantization
static
vision-language
multimodal
vllm
llm-compressor
internvl3 pipeline_tag: image-text-to-text inference: false license: mit

🔥 InternVL3-38B-FP8-Static: Optimized Vision-Language Model 🔥

This is a FP8 static quantized version of {source_model}, optimized for high-performance inference with vLLM.

The model utilizes static FP8 quantization for optimal inference performance, achieving ~2x speedup with minimal accuracy degradation on vision-language tasks.

🚀 Key Features

FP8 Static Quantization: Maximum inference performance with pre-computed activation scales
Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
vLLM Ready: Seamless integration with vLLM for production deployment
Memory Efficient: ~50% memory reduction compared to FP16 original
Performance Boost: Up to 2x faster inference on H100/L40S GPUs

📊 Model Details

Original Model: {source_model}
Source Model: {source_model}
Quantized Model: {quantized_model_name}
Quantization Method: FP8 {'Dynamic' if dynamic else 'Static'} (W8A8)
Quantization Library: LLM Compressor v{package_versions.get('llmcompressor', 'latest')}
Calibration Dataset: {calibration_dataset}{f' ({num_samples} samples, seq_len={seq_length})' if not dynamic else ''}
Attention Implementation: {attention_desc}
Quantized by: {hf_username}

🔧 Usage

With vLLM (Recommended)

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM(
    model="{hf_username}/{quantized_model_name}",
    trust_remote_code=True,
    max_model_len=8192,
    tensor_parallel_size=1,  # Adjust based on your GPU setup
)

# Generate response
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
response = model.generate("Describe this image: <image>", sampling_params)
print(response[0].outputs[0].text)

With Transformers + LLM Compressor

from transformers import AutoTokenizer, AutoProcessor
from llmcompressor import LLM

model_id = "{hf_username}/{quantized_model_name}"
model = LLM.load(model_id, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Process image and text
inputs = processor("What's in this image?", image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

🏗️ Technical Specifications

Hardware Requirements

Inference: 40-50GB VRAM (single H100/A100 recommended)
Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance)

Quantization Details

Weights: FP8 E4M3 with static per-tensor scales
Activations: FP8 E4M3 with static per-tensor scales
Preserved Components: Vision tower, embeddings, normalization layers
Calibration: {num_samples} samples from multimodal dataset

📈 Performance Benchmarks

Expected performance improvements over FP16 baseline:

Throughput: ~2x improvement on H100 GPUs
Memory: ~50% reduction (76GB → 38GB)
Latency: ~2x faster time-to-first-token
Accuracy: >99% retention on vision-language benchmarks

🔬 Package Versions

This model was created using:

llmcompressor=={package_versions.get('llmcompressor', 'latest')}
transformers=={package_versions.get('transformers', 'latest')}
torch=={package_versions.get('torch', 'latest')}
vllm=={package_versions.get('vllm', 'latest')}

📋 Quantization Script

Click to view the complete quantization script

{script_content}

🎯 Use Cases

This optimized model is ideal for:

Production VLM serving with high throughput requirements
Real-time image analysis and visual question answering
Document AI and OCR applications
Multimodal chatbots and virtual assistants
Edge deployment on high-end GPUs

⚠️ Important Notes

Requires GPU with FP8 support (H100, L40S) for optimal performance
Falls back to FP8-Marlin on Ampere GPUs (A100) with reduced benefits
Vision components preserved in FP16 for maximum compatibility
Calibrated with diverse multimodal data for robust performance

🚫 Limitations

Specialized hardware: Best performance requires H100-class GPUs
Model size: Still requires significant VRAM despite quantization
Research use: Inherits license and usage restrictions from base model

📄 License

This quantized model inherits the license from the original model. Original model: [{source_model}](https://huggingface.co/{source_mo


</details>

## 📄 ライセンス
この量子化モデルは、元のモデルのライセンスを引き継いでいます。
元のモデル: [OpenGVLab/InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B)

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

Roberta Base Chinese Extractive Qa

RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。

質問応答システム中国語

uer

2,694

未来を切り開く、あなたのAIソリューション知識ベース

English 简体中文繁體中文にほんご