Tiny-image-captioning open-source image description model - lightweight, only 100MB, extremely fast running speed on CPU

Tiny Image Captioning

Developed by cnmoro

A lightweight image captioning model based on bert-tiny and vit-small, weighing only 100MB, with extremely fast performance on CPU.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Lightweight Image Captioning #CPU-Efficient Inference #Multimodal Small Model

Downloads 4,298

Release Time : 1/28/2025

Model Overview

This model combines Vision Transformer (ViT) and BERT architectures to generate concise textual descriptions for input images. Suitable for applications requiring rapid image understanding.

Model Features

Lightweight & Efficient

The model is only 100MB in size and runs quickly on CPU (example shows ~0.11s per inference).

Dual-Model Architecture

Combines Vision Transformer (ViT-small) and a streamlined BERT (bert-tiny) to balance performance and efficiency.

Adjustable Parameters

Supports generation parameter tuning like temperature/top_p/top_k/beam search.

Model Capabilities

Image Understanding

Automatic Caption Generation

Visual Content Description

Use Cases

Accessibility Technology

Image Assistance Description

Automatically generates text descriptions of web images for visually impaired users.

Produces concise and accurate scene descriptions (e.g., 'A group of people walking in a city center').

Content Management

Media Library Auto-Tagging

Automatically generates search tags for large volumes of unlabeled images.

Quickly creates searchable image metadata.

🚀 Tiny Image Captioning Model

An ultra - lightweight image captioning model based on bert - tiny and vit - small, weighing only 100mb, and it runs very fast on CPU!

🚀 Quick Start

This is an image captioning model that combines the power of bert - tiny and vit - small. It's incredibly lightweight, weighing only 100mb, and can generate captions efficiently, even on a CPU.

✨ Features

Lightweight: Based on bert - tiny and vit - small, the model weighs only 100mb.
Fast Inference: Works very fast on CPU, making it accessible for various applications.

📦 Installation

Since this model uses the transformers library, you can install it via pip:

pip install transformers requests pillow

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/tiny-image-captioning"

# load the image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# preprocess an image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# generate caption - suggested settings
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # you can use 1 for even faster inference with a small drop in quality
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a group of people walking in the middle of a city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.11215853691101074 seconds
# on CPU !

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご