Mini-image-captioning: An Open-source Image Captioning Model - Lightweight, Free, and Incredibly Fast on CPU!

Mini Image Captioning

Developed by cnmoro

A lightweight image captioning model based on bert-mini and vit-small, weighing only 130MB, with extremely fast performance on CPU.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Lightweight Image Captioning #CPU-Efficient Inference #Multimodal Generation

Downloads 292

Release Time : 1/27/2025

Model Overview

This model combines the lightweight architectures of a vision encoder (ViT) and a text decoder (BERT), specifically designed to generate descriptive text captions for input images.

Model Features

Lightweight and Efficient

The model is only 130MB in size and is specially optimized for CPU inference speed (e.g., only 0.19 seconds in the example).

Dual-Modal Architecture

Combines the strengths of Vision Transformer (ViT) and Text Transformer (BERT).

Adjustable Generation

Supports various generation strategies such as temperature sampling, top-p/top-k filtering, and beam search.

Model Capabilities

Image Understanding

Natural Language Generation

Scene Description

Multimodal Processing

Use Cases

Content Generation

Social Media Image Tagging

Automatically generates descriptive text for uploaded social media images.

Produces coherent descriptions like 'A large crowd walking through a bustling city.'

Accessibility

Visual Impairment Assistance

Provides audio descriptions of image content for visually impaired users.

🚀 Mini Image Captioning Model

An image captioning model based on bert - mini and vit - small, weighing only 130mb and working very fast on CPU.

🚀 Quick Start

This is an image captioning model that combines bert - mini and vit - small. It's lightweight, only weighing 130mb, and can run efficiently on a CPU.

✨ Features

Lightweight: Only 130mb in size.
Fast on CPU: Can generate captions quickly even on a CPU.

📦 Installation

Since this model uses the transformers library, you can install it via the following command:

pip install transformers requests pillow

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/mini-image-captioning"

# load the image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# preprocess an image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# generate caption - suggested settings
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # you can use 1 for even faster inference with a small drop in quality
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a large group of people walking through a busy city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.19002342224121094 seconds
# on CPU !

📚 Documentation

Model Information

Property	Details
Model Type	Image Captioning Model
Base Model	google/bert_uncased_L - 4_H - 256_A - 4, WinKawaks/vit - small - patch16 - 224
Pipeline Tag	image - to - text
Library Name	transformers
Tags	vit, bert, vision, caption, captioning, image

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご