CogFlorence-2-Large-Freeze开源模型 - 免费实现精准图像转文本功能

首页

Cogflorence 2 Large Freeze

由 thwri 开发

这是microsoft/Florence-2-large模型的微调版本，在Ejafa/ye-pop数据集的38,000张图像子集上训练，使用CogVLM2生成标注，专注于图像转文本任务。

图像生成文本

Transformers

支持多种语言开源协议:MIT #图像精细标注 #多模态理解 #艺术图像解析

下载量 419

发布时间 : 7/4/2024

模型简介

该模型是一个视觉语言模型，能够根据输入的图像生成详细的文本描述。它在Florence-2-large基础上微调，增强了图像标注能力。

模型特点

高质量图像标注

能够生成详细、准确的图像描述，捕捉图像中的关键元素和细节

大规模数据微调

在38,000张多样化图像上训练，提升了模型的泛化能力

视觉编码器冻结

训练时保持视觉编码器参数不变，专注于文本生成能力的优化

模型能力

图像理解

详细图像描述生成

多元素场景分析

使用案例

内容生成

图像自动标注

为图像库中的图片自动生成详细描述

提高图像检索效率和可访问性

辅助技术

视觉辅助

为视障人士提供图像内容的详细语音描述

增强数字内容的可访问性

🚀 microsoft/Florence - 2 - large在Ejafa/ye - pop上微调并使用CogVLM2添加字幕

本仓库包含microsoft/Florence - 2 - large模型的微调版本。该模型在Ejafa/ye - pop数据集的38000张图像子集上进行了微调，其字幕使用THUDM/cogvlm2 - llama3 - chat - 19B生成。

🚀 快速开始

本仓库中的模型是microsoft/Florence - 2 - large的微调版本，可用于图像到文本的转换任务。通过在特定数据集上微调，模型的图像字幕生成能力得到了提升。

✨ 主要特性

基于microsoft/Florence - 2 - large模型进行微调。
在Ejafa/ye - pop数据集的38000张图像子集上训练。
使用THUDM/cogvlm2 - llama3 - chat - 19B生成字幕。

📦 安装指南

文档未提及具体安装步骤，可参考transformers库的安装方式。

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

from PIL import Image
import requests
import copy

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)

# {'<MORE_DETAILED_CAPTION>': 'a turquoise volkswagen beetle parked on a cobblestone street in front of a yellow wall with two wooden doors. the car's body is painted in a vibrant shade of teal, with a glossy finish that reflects the sunlight, and the wheels are polished with a silver hubcap. the building behind the car has a weathered, aged appearance, with visible cracks and peeling paint. the sky above is clear and blue, suggesting a sunny day.'}