Molmo-7B-D-0924-NF4 Open-source Model - Low video memory usage, suitable for environments with limited video memory

Molmo 7B D 0924 NF4

Developed by Scoolar

The 4Bit quantized version of Molmo-7B-D-0924, which reduces VRAM usage through the NF4 quantization strategy and is suitable for environments with limited VRAM.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #NF4 Quantization #Low VRAM Inference #Multimodal Generation

Downloads 1,259

Release Time : 1/31/2025

Model Overview

This model is a 4Bit quantized version of Molmo-7B-D-0924, adopting the NF4 quantization strategy. While reducing the model size and VRAM usage, it ensures the model performance as much as possible and is suitable for scenarios with high VRAM requirements.

Model Features

NF4 Quantization Strategy

Adopt NF4 quantization and retain FP16 in key modules to avoid significant performance degradation.

VRAM Optimization

The model occupies about 7GB of VRAM when loading and up to about 10GB during inference (with 4K image input), which is significantly reduced compared to the original model.

Fast Loading Speed

The model loading speed is significantly faster than the original model, suitable for serverless hosting.

Good Adaptability

It can run on a GPU with 12GB of VRAM and allows batch processing on a T4 (16GB).

Model Capabilities

Image Caption Generation

Vision-Language Understanding

Multimodal Inference

Use Cases

Image Understanding

Image Caption Generation

Generate natural language descriptions based on the input image.

Generate smooth and accurate image captions.

Serverless Hosting

Deployment in Low VRAM Environment

Deploy a vision-language model in an environment with limited VRAM.

Successfully run on a 12GB GPU.

🚀 Molmo-7B-D-0924 4Bit Quantization

This project focuses on the 4-bit quantization of the Molmo-7B-D-0924 model, aiming to reduce model size and VRAM usage while maintaining performance.

🚀 Quick Start

Model Information

Base model: AllenAI - Molmo-7B-D-0924
Model size (disk): 30GB original → 6.2GB
VRAM usage: Loaded Model ~7GB, inference up to ~10GB (4K image input)

This quantization uses NF4 quantization while keeping FP16 in key modules to avoid deteriorating performance.
It has a relatively minimal VRAM impact compared to full 4-bit quantization and aims to strike a performance/memory optimum.

The model loads significantly faster than the original, making it suitable for serverless hosting.
It fits into a 12GB GPU for serving and allows for batching on a T4 (16GB).

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch

# Can also be a local path if you have already cloned the hugging face repo
MODEL_PATH = "Scoolar/Molmo-7B-D-0924-NF4"

# load the processor
processor = AutoProcessor.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    device_map='auto'
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    device_map='auto',
)

# process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
    text="Describe this image."
)

# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# Compute is done in float16, while most weights are NF4
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
    output = model.generate_from_batch(
        inputs,
        GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
        tokenizer=processor.tokenizer
    )
    
# only get generated tokens; decode them to text
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

# print the generated text
print(generated_text)

📚 Documentation

How was the model converted to NF4?

I decided to write this down since I would have been happy to have something like this, so enjoy :)

To convert the model, you need to load the weights with the desired data types/quantization settings and save them again. This process will produce SafeTensor files along with some configuration files. All missing files can be copied from the original model repository—you only need to remove the file path in config.json.

The applied quantization strategy can also be seen in config.json (quantization_config)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Can also be a local path if you have already cloned the hugginface repo
MODEL_PATH = "allenai/Molmo-7B-D-0924"
YOUR_OUTPUT_PATH = "enter_local_model_output_path"

DEFAULT_DTYPE = torch.float16

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=DEFAULT_DTYPE,
    llm_int8_skip_modules=[
        # Module names can also be relative like "ff_norm" which would apply to all such layers
        "model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
    ]
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    device_map='auto',
    torch_dtype=DEFAULT_DTYPE,
    quantization_config=nf4_config,
)

# Save model
model.save_pretrained(
    save_directory=YOUR_OUTPUT_PATH,
    safe_serialization=True,
    # Set a maximum shard size if you don't like the default 
    max_shard_size="4GB"
)

Details

Inspired by observations from SeanScripts/Molmo-72B-0924-nf4, I experimented with keeping certain modules in FP16, particularly the vision_backbone. The vision backbone has relatively few parameters but deteriorates significantly in NF4. Additionally, I found that the transformer output layers are crucial, whereas other layer normalization layers within the transformer stack had no significant impact.

Layers can be easily inspected in model.safetensors.index.json or analyzed in more detail in modeling_molmo.py.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご