nano-image-captioning Open-Source Model - 40MB Lightweight, Rapidly Generate Image Captions on CPU

Nano Image Captioning

Developed by cnmoro

This is a lightweight image captioning model based on bert-tiny and vit-tiny, weighing only 40MB, with extremely fast inference speed on CPU.

Downloads 184

Release Time : 1/28/2025

Model Overview

The model combines a visual encoder (ViT-tiny) and a text decoder (BERT-tiny) to generate concise descriptive captions for input images.

Lightweight and Efficient

The model is only 40MB in size and achieves fast inference on CPU (approximately 0.075 seconds per image).

Dual Tiny Architecture

Uses vit-tiny-patch16-224 as the visual encoder and bert_uncased_L-2_H-128_A-2 as the text decoder.

Optimized Inference Settings

Provides multiple generation strategies including temperature sampling, top-p/top-k filtering, and beam search.

Image Understanding

Natural Language Generation

Real-Time Caption Generation

Accessibility Technology

Image Description Generation

Automatically generates text descriptions of images for visually impaired users.

Produces concise and accurate image descriptions (e.g., 'A group of people standing in a city center').

Content Management

Automatic Image Tagging

Automatically generates tags and descriptions for gallery or social media images.

Quickly generates searchable metadata.

Property	Details
Base Model	WinKawaks/vit-tiny-patch16-224, google/bert_uncased_L-2_H-128_A-2
Pipeline Tag	image-to-text
Library Name	transformers
Tags	vit, bert, vision, caption, captioning, image

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base