tiny-image-captioning開源圖像描述模型 - 輕量僅100MB，CPU運行速度極快

首頁

Tiny Image Captioning

由cnmoro開發

一個基於bert-tiny和vit-small的輕量級圖像描述生成模型，僅重100MB，在CPU上運行速度極快。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #輕量級圖像描述 #CPU高效推理 #多模態小模型

下載量 4,298

發布時間 : 1/28/2025

模型概述

該模型結合視覺Transformer（ViT）和BERT架構，能夠為輸入圖像生成簡潔的文字描述。適用於需要快速圖像理解的應用場景。

模型特點

輕量高效

模型僅100MB大小，在CPU上也能快速運行（示例顯示單次推理約0.11秒）

雙模型架構

結合視覺Transformer（ViT-small）和精簡版BERT（bert-tiny），平衡性能與效率

可調參數

支持temperature/top_p/top_k/beam search等生成參數調節

模型能力

圖像理解

自動字幕生成

視覺內容描述

使用案例

無障礙技術

圖像輔助描述

為視障用戶自動生成網頁圖像的文本描述

生成簡潔準確的場景描述（如'一群人在城市中心行走'）

內容管理

媒體庫自動標註

為大量未標註圖像自動生成搜索標籤

快速創建可搜索的圖像元數據

🚀 輕量級圖像描述模型

這是一個基於bert - tiny和vit - small的圖像描述模型，僅100mb！它在CPU上運行速度極快，能高效完成圖像描述任務。

🚀 快速開始

安裝依賴

確保你已經安裝了transformers庫，若未安裝，可以使用以下命令進行安裝：

pip install transformers requests pillow

運行示例代碼

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/tiny-image-captioning"

# load the image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# preprocess an image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# generate caption - suggested settings
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # you can use 1 for even faster inference with a small drop in quality
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a group of people walking in the middle of a city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.11215853691101074 seconds
# on CPU !

✨ 主要特性

輕量級：模型僅100mb，佔用資源少。
高效推理：在CPU上也能實現快速推理。

📦 安裝指南

使用pip安裝必要的庫：

pip install transformers requests pillow

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/tiny-image-captioning"

# load the image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# preprocess an image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# generate caption - suggested settings
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # you can use 1 for even faster inference with a small drop in quality
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a group of people walking in the middle of a city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.11215853691101074 seconds
# on CPU !

高級用法

如果你希望進一步提高推理速度，可以將num_beams參數設置為1，但可能會導致生成質量略有下降：

# ... 前面的代碼保持不變
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=1 # 更快的推理速度，但質量略有下降
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# ... 後續代碼保持不變

📚 詳細文檔

模型信息

屬性	詳情
基礎模型	WinKawaks/vit - small - patch16 - 224、google/bert_uncased_L - 2_H - 128_A - 2
模型類型	圖像描述模型
庫名稱	transformers
標籤	vit、bert、vision、caption、captioning、image