nano-image-captioning開源模型 - 40MB輕量體積，CPU上快速生成圖像字幕

首頁

Nano Image Captioning

由cnmoro開發

這是一個基於bert-tiny和vit-tiny的輕量級圖像字幕生成模型，僅重40MB，在CPU上運行速度極快。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #輕量級圖像字幕 #CPU高效推理 #多場景適用

下載量 184

發布時間 : 1/28/2025

模型概述

該模型結合了視覺編碼器（ViT-tiny）和文本解碼器（BERT-tiny），能夠為輸入的圖像生成簡潔的描述性字幕。

模型特點

輕量高效

模型僅40MB大小，在CPU上也能實現快速推理（約0.075秒/張）

雙微型架構

採用vit-tiny-patch16-224作為視覺編碼器，bert_uncased_L-2_H-128_A-2作為文本解碼器

優化推理設置

提供溫度採樣、top-p/top-k過濾和束搜索等多種生成策略

模型能力

圖像理解

自然語言生成

即時字幕生成

使用案例

無障礙技術

圖像描述生成

為視障用戶自動生成圖像的文字描述

生成簡潔準確的圖像描述（如：'一群人站在城市中心'）

內容管理

自動圖片標註

為圖庫或社交媒體圖片自動生成標籤和描述

快速生成可搜索的元數據

🚀 納米圖像字幕生成模型

這是一個基於BERT-Tiny和ViT-Tiny的圖像字幕生成模型，僅40MB！它在CPU上也能快速運行，為圖像添加描述信息提供了高效解決方案。

🚀 快速開始

此圖像字幕生成模型能快速為圖像生成描述。以下是使用該模型的步驟：

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/nano-image-captioning"

# 加載圖像字幕生成模型以及對應的分詞器和圖像處理器
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# 預處理圖像
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# 生成字幕 - 建議設置
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # 你可以使用1以實現更快的推理，但質量會略有下降
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a group of people are in the middle of a city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.07550048828125 seconds
# on CPU !

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/nano-image-captioning"

# 加載圖像字幕生成模型以及對應的分詞器和圖像處理器
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

# 預處理圖像
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

start = time.time()

# 生成字幕 - 建議設置
generated_ids = model.generate(
    pixel_values,
    temperature=0.7,
    top_p=0.8,
    top_k=50,
    num_beams=3 # 你可以使用1以實現更快的推理，但質量會略有下降
)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

end = time.time()

print(generated_text)
# a group of people are in the middle of a city.

print(f"Time taken: {end - start} seconds")
# Time taken: 0.07550048828125 seconds
# on CPU !

高級用法

# 如果你需要在多個圖像上進行批量處理，可以將圖像URL存儲在列表中，然後循環處理每個圖像。
from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel
import requests, time
from PIL import Image

model_path = "cnmoro/nano-image-captioning"

# 加載圖像字幕生成模型以及對應的分詞器和圖像處理器
model = VisionEncoderDecoderModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg",
    "https://example.com/another_image.jpg"
]

for url in image_urls:
    image = Image.open(requests.get(url, stream=True).raw)
    pixel_values = image_processor(image, return_tensors="pt").pixel_values

    start = time.time()

    # 生成字幕 - 建議設置
    generated_ids = model.generate(
        pixel_values,
        temperature=0.7,
        top_p=0.8,
        top_k=50,
        num_beams=3 # 你可以使用1以實現更快的推理，但質量會略有下降
    )
    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    end = time.time()

    print(generated_text)
    print(f"Time taken: {end - start} seconds")

📄 許可證

本項目採用Apache-2.0許可證。

📚 詳細文檔

屬性	詳情
基礎模型	WinKawaks/vit-tiny-patch16-224、google/bert_uncased_L-2_H-128_A-2
任務類型	圖像轉文本
庫名稱	Transformers
標籤	ViT、BERT、視覺、字幕、字幕生成、圖像