CogFlorence - 2.1 - Large开源模型 - 高效实现图像转文本实用功能

首页

Cogflorence 2.1 Large

由 thwri 开发

该模型是microsoft/Florence-2-large的微调版本，在Ejafa/ye-pop数据集的4万张图像子集上进行了训练，标注由THUDM/cogvlm2-llama3-chat-19B生成，专注于图像转文本任务。

图像生成文本

Transformers

支持多种语言开源协议:MIT #图像精细标注 #多模态生成 #艺术场景理解

下载量 2,541

发布时间 : 7/27/2024

模型简介

该模型主要用于图像转文本任务，能够生成详细的图像描述。通过在大规模图像数据集上的微调，提升了模型的标注能力。

模型特点

高质量图像标注

能够生成详细且准确的图像描述，适用于各种主题的图像。

大规模数据集训练

在Ejafa/ye-pop数据集的4万张图像子集上进行了微调，提升了模型的泛化能力。

冻结视觉编码器

训练期间视觉编码器被冻结，保持了原始模型的视觉特征提取能力。

模型能力

图像描述生成

多主题图像分析

高质量文本输出

使用案例

图像标注

详细图像描述

为图像生成详细的文本描述，适用于内容管理和检索。

生成包含颜色、形状、背景等细节的描述文本。

内容管理

自动化图像标签

为大量图像自动生成标签，提高内容管理效率。

快速生成准确的图像标签，减少人工标注工作量。

🚀 microsoft/Florence - 2 - large基于Ejafa/ye - pop数据集并使用CogVLM2生成字幕的微调版本

本仓库包含 microsoft/Florence - 2 - large 模型的微调版本。该模型在 Ejafa/ye - pop 数据集的40,000张图像子集上进行了微调，其中字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成。

🚀 快速开始

要使用此模型，你可以直接从Hugging Face模型中心加载它：

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street. The car is painted in a striking shade of turquoise, with a glossy finish that reflects the surrounding environment. The vehicle's rounded shape is accentuated by its rounded tires and chrome detailing. The background reveals a weathered yellow wall with a rustic wooden door, adding to the rustic charm of the scene. The sky above is clear, suggesting a sunny day. The overall style of the image is candid, capturing a moment in time without any posed or staged elements.'}

✨ 主要特性

基于 microsoft/Florence - 2 - large 模型进行微调，提升了图像字幕生成能力。
使用 Ejafa/ye - pop 数据集的40,000张图像子集进行训练，数据丰富多样。
字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成，保证了字幕质量。

📦 安装指南

本README未提及具体安装步骤，若有需要可参考Hugging Face相关文档进行模型加载及依赖安装。

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street. The car is painted in a striking shade of turquoise, with a glossy finish that reflects the surrounding environment. The vehicle's rounded shape is accentuated by its rounded tires and chrome detailing. The background reveals a weathered yellow wall with a rustic wooden door, adding to the rustic charm of the scene. The sky above is clear, suggesting a sunny day. The overall style of the image is candid, capturing a moment in time without any posed or staged elements.'}

高级用法

本README未提及高级用法相关代码示例，若有更复杂的使用场景需求，可根据模型的API文档进一步探索。

📚 详细文档

训练详情

视觉编码器：训练期间视觉编码器被冻结。
批量大小：64
梯度累积步数：16
学习率：5.12e - 05
优化器：AdamW
调度器：多项式
训练轮数：7.37

数据集

微调过程使用了 Ejafa/ye - pop 数据集的40,000张图像子集。该数据集包含各种不同主题的图像，为提高模型的字幕生成能力提供了强大的训练基础。

字幕生成

字幕使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成。

🔧 技术细节

本模型在训练时冻结了视觉编码器，通过特定的批量大小、梯度累积步数、学习率、优化器和调度器的设置，在 Ejafa/ye - pop 数据集的子集上进行了7.37轮的训练，以提升模型在图像字幕生成任务上的性能。同时，使用 THUDM/cogvlm2 - llama3 - chat - 19B 生成字幕，保证了字幕的质量和多样性。