InfiMM-HD开源多模态模型 - 免费部署实现图文结合内容理解与生成

首页

Infimm Hd

由 Infi-MM 开发

InfiMM-HD是一个高分辨率多模态模型，能够理解和生成结合图像和文本的内容。

图像生成文本

Transformers

英语#高分辨率多模态 #图像转文本 #多模态理解

下载量 17

发布时间 : 3/3/2024

模型简介

该模型专注于高分辨率多模态理解，能够处理图像和文本的联合任务，如图像描述生成等。

模型特点

高分辨率图像理解

能够处理高分辨率图像，提取丰富的视觉信息

多模态融合

有效融合视觉和文本信息，实现跨模态理解

中文优化

特别针对中文场景进行优化

模型能力

图像描述生成

视觉问答

多模态内容理解

图像转文本

使用案例

内容生成

图像自动描述

为图片生成详细的中文描述

可生成准确、丰富的图像描述

辅助工具

视觉辅助

帮助视障人士理解图像内容

提供详细的图像文字描述

🚀 InfiMM-HD

InfiMM-HD是一个用于高分辨率多模态理解的模型，可处理文本和图像数据，实现图像到文本的生成任务。它基于多个大规模数据集进行预训练，为多模态领域的研究和应用提供了强大支持。

🚀 快速开始

使用以下代码开始使用基础模型：

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

processor = AutoProcessor.from_pretrained("Infi-MM/infimm-hd", trust_remote_code=True)

prompts = [
    {
        "role": "user",
        "content": [
            {"image": "/xxx/test.jpg"}, # change it with you image
            "Please describe the image in detail.",
        ],
    }
]
inputs = processor(prompts)
# use bf16 and gpu 0
model = AutoModelForCausalLM.from_pretrained(
    "Infi-MM/infimm-hd",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(0).eval()

inputs = inputs

inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16)
for k in inputs:
    inputs[k] = inputs[k].to(model.device)

generated_ids = model.generate(
    **inputs,
    min_new_tokens=0,
    max_new_tokens=256,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)

📚 详细文档

更多详细信息可在我们的论文中找到：https://arxiv.org/abs/2403.01487。我们已经在 https://github.com/InfiMM/infimm-hd/ 上发布了预训练模型和PyTorch代码。您可以基于我们的预训练模型构建自己的模型。

📄 许可证

本项目采用 CC BY - NC 4.0 许可证。

图像的版权归原作者所有。

更多信息请参阅 LICENSE。

📞 联系我们

如果您有任何问题，请随时通过电子邮件 infimmbytedance@gmail.com 与我们联系。

📑 引用

@misc{liu2024infimmhd,
      title={InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding}, 
      author={Haogeng Liu and Quanzeng You and Xiaotian Han and Yiqi Wang and Bohan Zhai and Yongfei Liu and Yunzhe Tao and Huaibo Huang and Ran He and Hongxia Yang},
      year={2024},
      eprint={2403.01487},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}