Florence-2-FT-DocVQA Open-source Document Visual Question Answering Model - Handling Document Image Question Answering Tasks

Florence 2 FT DocVQA

Developed by sahilnishad

A document visual question answering model fine-tuned based on Florence-2-base, specifically designed for handling QA tasks in document images.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Document Image QA #Multimodal Processing #Florence-2 Fine-tuning

Downloads 4,928

Release Time : 11/2/2024

Model Overview

This model is fine-tuned on the DocumentVQA dataset, capable of understanding document image content and answering related questions, suitable for various document analysis scenarios.

Model Features

Document Image Understanding

Capable of parsing and understanding content and structure in document images.

Question Answering Capability

Provides accurate question answering functionality for document content.

Multimodal Processing

Simultaneously processes visual and textual information for cross-modal understanding.

Model Capabilities

Document Image Analysis

Visual Question Answering

Text Extraction

Cross-modal Understanding

Use Cases

Document Processing

Contract Analysis

Extract key terms and conditions from contract documents

Invoice Processing

Identify amounts, dates, and supplier information in invoices

Education

Exam Paper Grading

Automatically grade student answer sheets and extract answers

🚀 Transformers - Fine-tuned Florence-2 for DocumentVQA

This project presents a fine-tuned Florence-2 model on the DocumentVQA dataset, enabling question answering on document images. It offers a practical solution for multimodal question-answering tasks.

🚀 Quick Start

📦 Installation

!pip install torch transformers datasets flash_attn

💻 Loading model and processor

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

💻 Running inference

def run_inference(task_prompt, question, image):
    prompt = task_prompt + question

    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return generated_text

💻 Example

from PIL import Image
from datasets import load_dataset

data = load_dataset("HuggingFaceM4/DocumentVQA")

question = "What do you see in this image?"
image = data['train'][0]['image']
print(run_inference("<DocVQA>", question, image))

📚 Documentation

Model Description

This is a fine-tuned Florence-2 model on the DocumentVQA dataset, designed to perform question answering on document images.

Github

📄 License

This project is licensed under the MIT license.

📖 BibTeX

@misc{sahilnishad_florence_2_ft_docvqa,
  author       = {Sahil Nishad},
  title        = {Fine-Tuning Florence-2 For Document Visual Question-Answering},
  year         = {2024},
  url          = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA},
  note         = {Model available on HuggingFace Hub},
  howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}},
}

📋 Information Table

Property	Details
Library Name	transformers
License	MIT
Datasets	HuggingFaceM4/DocumentVQA
Base Model	microsoft/Florence-2-base
Tags	transformers, florence2, document-vqa, vqa, image-to-text, multimodal, question-answering

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご