Florence-2-FT-DocVQA開源文檔視覺問答模型

首頁

Florence 2 FT DocVQA

由sahilnishad開發

基於Florence-2-base微調的文檔視覺問答模型，專門用於處理文檔圖像中的問答任務。

圖像生成文本

Transformers

英語開源協議:MIT #文檔圖像問答 #多模態處理 #Florence-2微調

下載量 4,928

發布時間 : 11/2/2024

模型概述

該模型在DocumentVQA數據集上進行了微調，能夠理解文檔圖像內容並回答相關問題，適用於各種文檔分析場景。

模型特點

文檔圖像理解

能夠解析和理解文檔圖像中的內容和結構

問答能力

針對文檔內容提供準確的問答功能

多模態處理

同時處理視覺和文本信息，實現跨模態理解

模型能力

文檔圖像分析

視覺問答

文本提取

跨模態理解

使用案例

文檔處理

合同分析

從合同文檔中提取關鍵條款和條件

發票處理

識別發票中的金額、日期和供應商信息

教育

試卷批改

自動批改學生答卷並提取答案

🚀 基於DocumentVQA數據集微調的Florence - 2模型

本項目是在DocumentVQA數據集上對Florence - 2模型進行微調，使其能夠在文檔圖像上進行問答。該模型具有多模態處理能力，可用於圖像到文本的轉換、視覺問答等任務。

🚀 快速開始

安裝依賴

!pip install torch transformers datasets flash_attn

加載模型和處理器

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

運行推理

def run_inference(task_prompt, question, image):
    prompt = task_prompt + question

    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return generated_text

示例

from PIL import Image
from datasets import load_dataset

data = load_dataset("HuggingFaceM4/DocumentVQA")

question = "What do you see in this image?"
image = data['train'][0]['image']
print(run_inference("<DocVQA>", question, image))

📚 詳細文檔

項目Github地址：點擊查看

📄 許可證

本項目採用MIT許可證。

📚 引用信息

@misc{sahilnishad_florence_2_ft_docvqa,
  author       = {Sahil Nishad},
  title        = {Fine-Tuning Florence-2 For Document Visual Question-Answering},
  year         = {2024},
  url          = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA},
  note         = {Model available on HuggingFace Hub},
  howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}},
}

📦 模型信息

屬性	詳情
模型類型	基於Florence - 2的微調模型
訓練數據	HuggingFaceM4/DocumentVQA
基礎模型	microsoft/Florence-2-base
標籤	transformers, florence2, document - vqa, vqa, image - to - text, multimodal, question - answering