ReflectiVA Open-Source Multimodal Large Model - Enhance Visual Question Answering Capability, Deploy for Free and Start Using!

Reflectiva

Developed by aimagelab

ReflectiVA is a multimodal large language model that enhances visual question answering capabilities by integrating external knowledge sources and a reflection token mechanism.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Knowledge Enhancement #Dynamic Knowledge Retrieval #Visual Question Answering Optimization

Downloads 46

Release Time : 11/25/2024

Model Overview

ReflectiVA is an innovative multimodal large language model capable of processing both text and image inputs. It dynamically determines whether external knowledge is needed through reflection tokens and retrieves relevant information from external databases when required, thereby improving performance in knowledge-based visual question answering tasks.

Model Features

Reflection Token Mechanism

Dynamically determines the need for external knowledge through specially designed reflection tokens, enabling intelligent knowledge retrieval.

Dual-stage Training

Adopts a dual-model training approach to maintain baseline performance while enhancing knowledge acquisition capabilities.

Knowledge Enhancement

Effectively integrates external knowledge sources to improve accuracy in complex visual question answering tasks.

Model Capabilities

Multimodal Understanding

Visual Question Answering

External Knowledge Retrieval

Image-Text Joint Processing

Use Cases

Education

Complex Visual Question Answering

Answering image-related questions that require external knowledge

Outperforms existing methods in knowledge-based visual question answering tasks

Research

Multimodal Research

Exploring mechanisms of joint visual and language understanding

🚀 Model Card: Reflective LLaVA (ReflectiVA)

Multimodal LLMs (MLLMs) extend large language models to handle text and image data. This project introduces ReflectiVA, a novel approach to enhance MLLM adaptability by integrating external knowledge sources.

🚀 Quick Start

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility.

In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two - stage two - model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.

The efficacy of ReflectiVA for knowledge - based visual question answering is demonstrated, highlighting its superior performance compared to existing methods.

In this model space, you will find the Overall Model (stage two) weights of ReflectiVA.

For more information, visit our ReflectiVA repository, our project page and the dataset.

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Pipeline Tag	image - text - to - text
License	apache - 2.0

📄 License

This project is licensed under the apache - 2.0 license.

📚 Documentation

Citation

If you make use of our work, please cite our repo:

@inproceedings{cocchi2024augmenting,
  title={{Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering}},
  author={Cocchi, Federico and Moratelli, Nicholas and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Paper page

Paper can be found at https://huggingface.co/papers/2411.16863.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご