InfiMM-HD Open-source Multimodal Model - Free Deployment for Understanding and Generating Text-Image Combined Content

Infimm Hd

Developed by Infi-MM

InfiMM-HD is a high-resolution multimodal model capable of understanding and generating content that combines images and text.

Image-to-Text

Transformers

English#High-Resolution Multimodal #Image-to-Text #Multimodal Understanding

Downloads 17

Release Time : 3/3/2024

Model Overview

This model focuses on high-resolution multimodal understanding and can handle joint tasks involving images and text, such as image caption generation.

Model Features

High-Resolution Image Understanding

Capable of processing high-resolution images to extract rich visual information.

Multimodal Fusion

Effectively integrates visual and textual information for cross-modal understanding.

Chinese Optimization

Specially optimized for Chinese language scenarios.

Model Capabilities

Image Caption Generation

Visual Question Answering

Multimodal Content Understanding

Image-to-Text

Use Cases

Content Generation

Automatic Image Captioning

Generates detailed Chinese descriptions for images.

Produces accurate and rich image descriptions.

Assistive Tools

Visual Assistance

Helps visually impaired individuals understand image content.

Provides detailed textual descriptions of images.

🚀 InfiMM-HD

InfiMM-HD is a multimodal model that combines text and image data, enabling high - resolution multimodal understanding and text generation.

🚀 Quick Start

Use the code below to get started with the base model:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

processor = AutoProcessor.from_pretrained("Infi-MM/infimm-hd", trust_remote_code=True)

prompts = [
    {
        "role": "user",
        "content": [
            {"image": "/xxx/test.jpg"}, # change it with you image
            "Please describe the image in detail.",
        ],
    }
]
inputs = processor(prompts)
# use bf16 and gpu 0
model = AutoModelForCausalLM.from_pretrained(
    "Infi-MM/infimm-hd",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(0).eval()

inputs = inputs

inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16)
for k in inputs:
    inputs[k] = inputs[k].to(model.device)

generated_ids = model.generate(
    **inputs,
    min_new_tokens=0,
    max_new_tokens=256,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)

📚 Documentation

More detailes can be found in our paper at https://arxiv.org/abs/2403.01487. We have released the pretraining model and the pyotrch code at https://github.com/InfiMM/infimm-hd/. Feel free to build your model from our pretrained model.

📄 License

This project is licensed under the CC BY-NC 4.0.

The copyright of the images belongs to the original authors.

See LICENSE for more information.

📞 Contact Us

Please feel free to contact us via email infimmbytedance@gmail.com if you have any questions.

📖 Citation

@misc{liu2024infimmhd,
      title={InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding}, 
      author={Haogeng Liu and Quanzeng You and Xiaotian Han and Yiqi Wang and Bohan Zhai and Yongfei Liu and Yunzhe Tao and Huaibo Huang and Ran He and Hongxia Yang},
      year={2024},
      eprint={2403.01487},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Property	Details
Tags	multimodal, text, image, image - to - text
Datasets	HuggingFaceM4/OBELICS, laion/laion2B - en, coyo - 700m, mmc4
Pipeline Tag	text - generation
Inference	true

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご