CogFlorence - 2.1 - Large開源模型 - 高效實現圖像轉文本實用功能

首頁

Cogflorence 2.1 Large

由thwri開發

該模型是microsoft/Florence-2-large的微調版本，在Ejafa/ye-pop數據集的4萬張圖像子集上進行了訓練，標註由THUDM/cogvlm2-llama3-chat-19B生成，專注於圖像轉文本任務。

圖像生成文本

Transformers

支持多種語言開源協議:MIT #圖像精細標註 #多模態生成 #藝術場景理解

下載量 2,541

發布時間 : 7/27/2024

模型概述

該模型主要用於圖像轉文本任務，能夠生成詳細的圖像描述。通過在大規模圖像數據集上的微調，提升了模型的標註能力。

模型特點

高質量圖像標註

能夠生成詳細且準確的圖像描述，適用於各種主題的圖像。

大規模數據集訓練

在Ejafa/ye-pop數據集的4萬張圖像子集上進行了微調，提升了模型的泛化能力。

凍結視覺編碼器

訓練期間視覺編碼器被凍結，保持了原始模型的視覺特徵提取能力。

模型能力

圖像描述生成

多主題圖像分析

高質量文本輸出

使用案例

圖像標註

詳細圖像描述

為圖像生成詳細的文本描述，適用於內容管理和檢索。

生成包含顏色、形狀、背景等細節的描述文本。

內容管理

自動化圖像標籤

為大量圖像自動生成標籤，提高內容管理效率。

快速生成準確的圖像標籤，減少人工標註工作量。

🚀 microsoft/Florence - 2 - large基於Ejafa/ye - pop數據集並使用CogVLM2生成字幕的微調版本

本倉庫包含 microsoft/Florence - 2 - large 模型的微調版本。該模型在 Ejafa/ye - pop 數據集的40,000張圖像子集上進行了微調，其中字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成。

🚀 快速開始

要使用此模型，你可以直接從Hugging Face模型中心加載它：

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street. The car is painted in a striking shade of turquoise, with a glossy finish that reflects the surrounding environment. The vehicle's rounded shape is accentuated by its rounded tires and chrome detailing. The background reveals a weathered yellow wall with a rustic wooden door, adding to the rustic charm of the scene. The sky above is clear, suggesting a sunny day. The overall style of the image is candid, capturing a moment in time without any posed or staged elements.'}

✨ 主要特性

基於 microsoft/Florence - 2 - large 模型進行微調，提升了圖像字幕生成能力。
使用 Ejafa/ye - pop 數據集的40,000張圖像子集進行訓練，數據豐富多樣。
字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成，保證了字幕質量。

📦 安裝指南

本README未提及具體安裝步驟，若有需要可參考Hugging Face相關文檔進行模型加載及依賴安裝。

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street. The car is painted in a striking shade of turquoise, with a glossy finish that reflects the surrounding environment. The vehicle's rounded shape is accentuated by its rounded tires and chrome detailing. The background reveals a weathered yellow wall with a rustic wooden door, adding to the rustic charm of the scene. The sky above is clear, suggesting a sunny day. The overall style of the image is candid, capturing a moment in time without any posed or staged elements.'}

高級用法

本README未提及高級用法相關代碼示例，若有更復雜的使用場景需求，可根據模型的API文檔進一步探索。

📚 詳細文檔

訓練詳情

視覺編碼器：訓練期間視覺編碼器被凍結。
批量大小：64
梯度累積步數：16
學習率：5.12e - 05
優化器：AdamW
調度器：多項式
訓練輪數：7.37

數據集

微調過程使用了 Ejafa/ye - pop 數據集的40,000張圖像子集。該數據集包含各種不同主題的圖像，為提高模型的字幕生成能力提供了強大的訓練基礎。

字幕生成

字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成。

🔧 技術細節

本模型在訓練時凍結了視覺編碼器，通過特定的批量大小、梯度累積步數、學習率、優化器和調度器的設置，在 Ejafa/ye - pop 數據集的子集上進行了7.37輪的訓練，以提升模型在圖像字幕生成任務上的性能。同時，使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成字幕，保證了字幕的質量和多樣性。