RexSeek-3B Open-Source Model - A Practical Tool for Converting Image-Text Inputs into Text Outputs

Rexseek 3B

Developed by IDEA-Research

This is an image-to-text conversion model capable of processing both image and text inputs to generate corresponding text outputs.

Text-to-Image

Transformers

Open Source License:Other #Image-to-Text Generation #Multimodal Conversion #Visual Language Understanding

Downloads 186

Release Time : 3/10/2025

Model Overview

This model is primarily designed for tasks combining images and text, capable of understanding image content and generating relevant textual descriptions or responses.

Model Features

Multimodal Processing

Capable of simultaneously processing image and text inputs to achieve cross-modal understanding and generation.

Text Generation

Generates relevant textual descriptions or answers based on image content.

Model Capabilities

Image Understanding

Text Generation

Multimodal Task Processing

Use Cases

Content Generation

Image Captioning

Generates detailed textual descriptions for images

Produces text descriptions that accurately reflect image content

Visual Question Answering

Answers natural language questions about image content

Provides accurate answers related to the image

Assistive Tools

Accessibility Applications

Provides image content descriptions for visually impaired individuals

Enhances information accessibility for visually impaired users

🚀 RexSeek

RexSeek is a Multimodal Large Language Model (MLLM) designed to detect people or objects in images based on natural language descriptions. It excels at multi - instance referring tasks, which is a significant advantage over traditional single - instance detection models.

🚀 Quick Start

First, install the necessary environment and download the pre - trained models. Then, you can combine RexSeek with different tools like GroundingDINO, Spacy, and SAM to perform various tasks. Finally, you can use the Gradio demo to interact with the model more intuitively.

✨ Features

Multi - Instance Detection: Can identify multiple matching instances in a single image.
Robust Perception: Powered by state - of - the - art person detection models.
Strong Language Understanding: Leverages advanced LLM capabilities for complex description comprehension.

📦 Installation

Install the basic environment

conda install -n rexseek python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
pip install -v -e .

Download Pre - trained Models

We provide model checkpoints for RexSeek - 3B. You can download the pre - trained models from the following links:

[ChatRex - 3B Checkpoint](https://huggingface.co/IDEA - Research/RexSeek - 3B)

Or you can also use the following command to download the pre - trained models:

# Download ChatRex checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/IDEA - Research/RexSeek - 3B IDEA - Research/RexSeek - 3B

Verify Installation

To verify the installation, run the following command:

python tests/test_local_load.py

If the installation is successful, you will get a visualization image in tests/images folder.

💻 Usage Examples

Model Architecture

TL;DR: RexSeek needs a model to propose object boxes first, then use the LLM to detect the objects.

RexSeek consists of three key components:

Vision Encoders: Dual - resolution feature extraction (CLIP + ConvNeXt).
Person Detector: DINO - X for generating high - quality object proposals.
Language Model: Qwen2.5 for understanding complex referring expressions.

Inputs:
- Image: The source image containing people/objects.
- Text: Natural language description of target objects.
- Boxes: Object proposals from DINO - X detector (can be replaced with custom boxes).
Outputs:
- Object indices corresponding to the referring expression in format:
```
<ground>referring text</ground><objects><obj1><obj2>...</objects>
```

Combine RexSeek with GroundingDINO

Install GroundingDINO

cd demos/
git clone https://github.com/IDEA - Research/GroundingDINO.git
cd GroundingDINO
pip install -v -e .
mkdir weights
wget -q https://github.com/IDEA - Research/GroundingDINO/releases/download/v0.1.0 - alpha/groundingdino_swint_ogc.pth -P weights
cd ../../../

Run the Demo

python demos/rexseek_grounding_dino.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --objects "person" \
    --text - threshold 0.25 \
    --box - threshold 0.25

Combine RexSeek with GroundingDINO and Spacy

Install Dependencies

pip install spacy
python -m spacy download en_core_web_sm

Run the Demo

python demos/rexseek_grounding_dino_spacy.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text - threshold 0.25 \
    --box - threshold 0.25

Combine RexSeek with GroundingDINO, Spacy and SAM

Install Dependencies

cd demos/
git clone https://github.com/IDEA - Research/SAM.git  
cd SAM
pip install -v -e .
mkdir weights
wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P weights
cd ../../../

Run the Demo

python demos/rexseek_grounding_dino_spacy_sam.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text - threshold 0.25 \
    --box - threshold 0.25

Gradio Demo for RexSeek + GroundingDINO + SAM

We provide a gradio demo for RexSeek + GroundingDINO + SAM. You can run the following command to start the gradio demo:

python demos/gradio_demo.py \
    --rexseek - path "IDEA - Research/RexSeek - 3B" \
    --gdino - config "demos/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py" \
    --gdino - weights "demos/GroundingDINO/weights/groundingdino_swint_ogc.pth" \
    --sam - weights "demos/segment - anything/weights/sam_vit_h_4b8939.pth"

📚 Documentation

HumanRef Benchmark

HumanRef is a large - scale human - centric referring expression dataset designed for multi - instance human referring in natural scenes. Key features of HumanRef include:

Multi - Instance Referring: A single referring expression can correspond to multiple individuals, better reflecting real - world scenarios.
Diverse Referring Types: Covers 6 major types of referring expressions:
- Attribute - based (e.g., gender, age, clothing).
- Position - based (relative positions between humans or with environment).
- Interaction - based (human - human or human - environment interactions).
- Reasoning - based (complex logical combinations).
- Celebrity Recognition.
- Rejection Cases (non - existent references).
High - Quality Data:
- 34,806 high - resolution images (>1000×1000 pixels).
- 103,028 referring expressions in training set.
- 6,000 carefully curated expressions in benchmark set.
- Average 8.6 persons per image.
- Average 2.2 target boxes per referring expression.

Download

You can download the HumanRef Benchmark at [https://huggingface.co/datasets/IDEA - Research/HumanRef](https://huggingface.co/datasets/IDEA - Research/HumanRef).

Visualization

HumanRef Benchmark contains 6 domains, each domain may have multiple sub - domains.

Domain	Subdomain	Num Referrings
attribute	1000_attribute_retranslated_with_mask	1000
position	500_inner_position_data_with_mask	500
position	500_outer_position_data_with_mask	500
celebrity	1000_celebrity_data_with_mask	1000
interaction	500_inner_interaction_data_with_mask	500
interaction	500_outer_interaction_data_with_mask	500
reasoning	229_outer_position_two_stage_with_mask	229
reasoning	271_positive_then_negative_reasoning_with_mask	271
reasoning	500_inner_position_two_stage_with_mask	500
rejection	1000_rejection_referring_with_mask	1000

To visualize the dataset, you can run the following command:

python rexseek/tools/visualize_humanref.py \
    --anno_path "IDEA - Research/HumanRef/annotations.jsonl" \
    --image_root_dir "IDEA - Research/HumanRef/images" \
    --domain_anme "attribute" \ # attribute, position, interaction, reasoning, celebrity, rejection
    --sub_domain_anme "1000_attribute_retranslated_with_mask" \ # 1000_attribute_retranslated_with_mask, 500_inner_position_data_with_mask, 500_outer_position_data_with_mask, 1000_celebrity_data_with_mask, 500_inner_interaction_data_with_mask, 500_outer_interaction_data_with_mask, 229_outer_position_two_stage_with_mask, 271_positive_then_negative_reasoning_with_mask, 500_inner_position_two_stage_with_mask, 1000_rejection_referring_with_mask
    --vis_path "IDEA - Research/HumanRef/visualize" \
    --num_images 50 \
    --vis_mask True # True, False

Evaluation

Metrics

We evaluate the referring task using three main metrics: Precision, Recall, and DensityF1 Score.

Precision & Recall: For each referring expression, a predicted bounding box is considered correct if its IoU with any ground truth box exceeds a threshold. We report average performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05.
Point - based Evaluation: For models that only output points (e.g., Molmo), a prediction is considered correct if the predicted point falls within the mask of the corresponding instance.
Rejection Accuracy: For the rejection subset, we calculate:
```
Rejection Accuracy = Number of correctly rejected expressions / Total number of expressions
```
where a correct rejection means the model predicts no boxes for a non - existent reference.
DensityF1 Score:

DensityF1 = (1/N) * Σ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]

where D_i is the density penalty factor:

D_i = min(1.0, GT_Count_i / Predicted_Count_i)

where:

N is the number of referring expressions.
GT_Count_i is the total number of persons in image i.
Predicted_Count_i is the number of predicted boxes for referring expression i.

Evaluation Script

Prediction Format

Before running the evaluation, you need to prepare your model's predictions in the correct format. Each prediction should be a JSON line in a JSONL file with the following structure:

{
  "id": "image_id",
  "extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}

Where:

id: The image identifier matching the ground truth data.
extracted_predictions: A list of bounding boxes in [x1, y1, x2, y2] format or points in [x, y] format.

For rejection cases, you should either:

Include an empty list: "extracted_predictions": [].
Include a list with an empty box: "extracted_predictions": [[]].

Running the Evaluation

You can run the evaluation script using the following command:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA - Research/HumanRef/annotations.jsonl \
  --pred_path path/to/your/predictions.jsonl \
  --pred_names "Your Model Name" \
  --dump_path IDEA - Research/HumanRef/evaluation_results/your_model_results

Parameters:

--gt_path: Path to the ground truth annotations file.
--pred_path: Path to your prediction file(s). You can provide multiple paths to compare different models.
--pred_names: Names for your models (for display in the results).
--dump_path: Directory to save the evaluation results in markdown and JSON formats.

Evaluating Multiple Models:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA - Research/HumanRef/annotations.jsonl \
  --pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl \
  --pred_names "Model 1" "Model 2" "Model 3" \
  --dump_path IDEA - Research/HumanRef/evaluation_results/comparison

Programmatic Usage

from rexseek.metric.recall_precision_densityf1 import recall_precision_densityf1

recall_precision_densityf1(
    gt_path="IDEA - Research/HumanRef/annotations.jsonl",
    pred_path=["path/to/your/predictions.jsonl"],
    dump_path="IDEA - Research/HumanRef/evaluation_results/your_model_results"
)

Evaluate RexSeek

First we need to run the following command to generate the predictions:

python rexseek/evaluation/evaluate_rexseek.py \
    --model_path IDEA - Research/RexSeek - 3B \
    --image_folder IDEA - Research/HumanRef/images \
    --question_file IDEA - Research/HumanRef/annotations.jsonl \
    --answers_file IDEA - Research/HumanRef/evaluation_results/eval_rexseek/RexSeek - 3B_results.jsonl \

Then we can run the following command to evaluate the RexSeek model:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA - Research/HumanRef/annotations.jsonl \
  --pred_path  IDEA - Research/HumanRef/evaluation_results/eval_rexseek/RexSeek - 3B_results.jsonl\
  --pred_names "RexSeek - 3B" \
  --dump_path IDEA - Research/HumanRef/evaluation_results/comparison

📄 License

ChatRex is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the:

[OpenAI Terms of Use](https://openai.com/policies/terms - of - use) for the dataset.
For the LLM used in this project, the model is [Qwen/Qwen2.5 - 3B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 3B - Instruct), which is licensed under [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5 - 3B - Instruct/blob/main/LICENSE).
For the high resolution vision encoder, we are using [laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg](https://huggingface.co/laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg) which is licensed under MIT LICENSE.
For the low resolution vision encoder, we are using [openai/clip - vit - large - patch14](https://huggingface.co/openai/clip - vit - large - patch14) which is licensed under MIT LICENSE

BibTeX 📚

@misc{jiang2025referringperson,
      title={Referring to Any Person}, 
      author={Qing Jiang and Lin Wu and Zhaoyang Zeng and Tianhe Ren and Yuda Xiong and Yihao Chen and Qin Liu and Lei Zhang},
      year={2025},
      eprint={2503.08507},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.08507}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご