nanoVLM-222M Open-Source Vision-Language Model - Lightweight Design Facilitates Efficient Training and Experiments

Nanovlm 222M

Developed by lusxvr

nanoVLM is an ultra-minimalist lightweight vision-language model (VLM) designed for efficient training and experimentation.

Image-to-Text

Safetensors

Open Source License:Apache-2.0 #Lightweight Vision-Language #Efficient Multimodal #Low-Resource Training

Downloads 2,441

Release Time : 5/1/2025

Model Overview

nanoVLM combines a ViT-based image encoder with a lightweight causal language model to form a compact 222-million-parameter model, suitable for VLM research and development in low-resource environments.

Model Features

Lightweight Design

The entire model architecture and training logic are implemented in just about 750 lines of code, with a parameter scale of only 222 million.

Efficient Training

Training can be completed in just 6 hours on a single H100 GPU, making it ideal for rapid experimentation.

Multimodal Architecture

Combines a vision Transformer and a causal language model to achieve joint processing of images and text.

Low-Resource Research Baseline

Achieves 35.3% accuracy on the MMStar benchmark, providing a reference for low-resource VLM research.

Model Capabilities

Vision-Language Understanding

Image-Text Generation

Multimodal Task Processing

Use Cases

Research

Vision-Language Model Research

Used as a lightweight baseline model for studying VLM architectures and training methods.

Provides a reference accuracy of 35.3% on the MMStar benchmark.

Education

Multimodal Learning

Used for teaching and demonstrating the fundamentals of multimodal models.

🚀 nanoVLM

nanoVLM is a minimal and lightweight Vision-Language Model (VLM) crafted for efficient training and experimentation. It addresses the need for low - resource VLM research, offering a compact yet effective solution for exploring multimodal architectures with minimal computational overhead.

✨ Features

Compact Design: Built entirely with pure PyTorch, the entire model architecture and training logic are encapsulated within approximately 750 lines of code.
Lightweight Composition: It combines a ViT - based image encoder (SigLIP - B/16 - 224 - 85M) with a lightweight causal language model (SmolLM2 - 135M), resulting in a compact model with 222M parameters.
Strong Baseline: After training for about 6 hours on a single H100 GPU using 1.7M samples from the cauldron dataset, it achieves 35.3% accuracy on MMStar, making it a robust baseline for low - resource VLM research.
Ideal for Exploration: Perfect for researchers and developers keen on exploring VLM training with minimal computational requirements.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from models.vision_language_model import VisionLanguageModel
model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M")

The model can be used through the nanoVLM repository: https://github.com/huggingface/nanoVLM. For more details, see: https://github.com/huggingface/nanoVLM?tab=readme-ov-file#hub-integration

📚 Documentation

Model Architecture

Vision Transformer (SigLIP - B/16): Serves as the image encoder component of the model.
Causal Language Model (SmolLM2): Forms the language - processing part of the multimodal model.
Modality Projection Layer: Helps in integrating the visual and language modalities.

Training

Training Data: Trained on approximately 1.7M samples from the the_cauldron dataset.
Hardware and Time: Trained for 6 hours on a single NVIDIA H100 GPU.
Model Size: The resulting model has 222M parameters.

Evaluation

MMStar Accuracy: The model achieves an accuracy of 35.3% on the MMStar metric.

📄 License

The model is released under the Apache 2.0 license.

Property	Details
Model Type	Vision - Language Model (VLM)
Training Data	~1.7M samples from the `the_cauldron` dataset
Metrics	Accuracy
Pipeline Tag	image - text - to - text
Tags	vision - language, multimodal, pytorch, small - model, efficient, research, VLM
Model Name	nanoVLM
Library Name	nanovlm
Datasets	HuggingFaceM4/the_cauldron

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご