N

Nanovlm 222M

Developed by lusxvr
nanoVLM is an ultra-minimalist lightweight vision-language model (VLM) designed for efficient training and experimentation.
Downloads 2,441
Release Time : 5/1/2025

Model Overview

nanoVLM combines a ViT-based image encoder with a lightweight causal language model to form a compact 222-million-parameter model, suitable for VLM research and development in low-resource environments.

Model Features

Lightweight Design
The entire model architecture and training logic are implemented in just about 750 lines of code, with a parameter scale of only 222 million.
Efficient Training
Training can be completed in just 6 hours on a single H100 GPU, making it ideal for rapid experimentation.
Multimodal Architecture
Combines a vision Transformer and a causal language model to achieve joint processing of images and text.
Low-Resource Research Baseline
Achieves 35.3% accuracy on the MMStar benchmark, providing a reference for low-resource VLM research.

Model Capabilities

Vision-Language Understanding
Image-Text Generation
Multimodal Task Processing

Use Cases

Research
Vision-Language Model Research
Used as a lightweight baseline model for studying VLM architectures and training methods.
Provides a reference accuracy of 35.3% on the MMStar benchmark.
Education
Multimodal Learning
Used for teaching and demonstrating the fundamentals of multimodal models.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase