🚀 Molmo-7B-D-0924 4Bit Quantization
This project focuses on the 4-bit quantization of the Molmo-7B-D-0924 model, aiming to reduce model size and VRAM usage while maintaining performance.
🚀 Quick Start
Model Information
Base model: AllenAI - Molmo-7B-D-0924
Model size (disk): 30GB original → 6.2GB
VRAM usage: Loaded Model ~7GB, inference up to ~10GB (4K image input)
This quantization uses NF4 quantization while keeping FP16 in key modules to avoid deteriorating performance.
It has a relatively minimal VRAM impact compared to full 4-bit quantization and aims to strike a performance/memory optimum.
The model loads significantly faster than the original, making it suitable for serverless hosting.
It fits into a 12GB GPU for serving and allows for batching on a T4 (16GB).
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch
MODEL_PATH = "Scoolar/Molmo-7B-D-0924-NF4"
processor = AutoProcessor.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
)
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
text="Describe this image."
)
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
📚 Documentation
How was the model converted to NF4?
I decided to write this down since I would have been happy to have something like this, so enjoy :)
To convert the model, you need to load the weights with the desired data types/quantization settings
and save them again. This process will produce SafeTensor files along with some configuration files.
All missing files can be copied from the original model repository—you only need to remove the file path in config.json
.
The applied quantization strategy can also be seen in config.json
(quantization_config)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
MODEL_PATH = "allenai/Molmo-7B-D-0924"
YOUR_OUTPUT_PATH = "enter_local_model_output_path"
DEFAULT_DTYPE = torch.float16
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=DEFAULT_DTYPE,
llm_int8_skip_modules=[
"model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
]
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
torch_dtype=DEFAULT_DTYPE,
quantization_config=nf4_config,
)
model.save_pretrained(
save_directory=YOUR_OUTPUT_PATH,
safe_serialization=True,
max_shard_size="4GB"
)
Details
Inspired by observations from SeanScripts/Molmo-72B-0924-nf4,
I experimented with keeping certain modules in FP16, particularly the vision_backbone. The vision backbone
has relatively few parameters but deteriorates significantly in NF4. Additionally, I found that the transformer
output layers are crucial, whereas other layer normalization layers within the transformer stack had no significant impact.
Layers can be easily inspected in model.safetensors.index.json
or analyzed in more detail in modeling_molmo.py
.
📄 License
This project is licensed under the Apache-2.0 license.