🚀 基于DocumentVQA数据集微调的Florence - 2模型
本项目是在DocumentVQA数据集上对Florence - 2模型进行微调,使其能够在文档图像上进行问答。该模型具有多模态处理能力,可用于图像到文本的转换、视觉问答等任务。
🚀 快速开始
安装依赖
!pip install torch transformers datasets flash_attn
加载模型和处理器
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
运行推理
def run_inference(task_prompt, question, image):
prompt = task_prompt + question
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
return generated_text
示例
from PIL import Image
from datasets import load_dataset
data = load_dataset("HuggingFaceM4/DocumentVQA")
question = "What do you see in this image?"
image = data['train'][0]['image']
print(run_inference("<DocVQA>", question, image))
📚 详细文档
📄 许可证
本项目采用MIT许可证。
📚 引用信息
@misc{sahilnishad_florence_2_ft_docvqa,
author = {Sahil Nishad},
title = {Fine-Tuning Florence-2 For Document Visual Question-Answering},
year = {2024},
url = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA},
note = {Model available on HuggingFace Hub},
howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}},
}
📦 模型信息
属性 |
详情 |
模型类型 |
基于Florence - 2的微调模型 |
训练数据 |
HuggingFaceM4/DocumentVQA |
基础模型 |
microsoft/Florence-2-base |
标签 |
transformers, florence2, document - vqa, vqa, image - to - text, multimodal, question - answering |