RexSeek-3B开源模型 - 图像文本输入转文本输出的实用利器

首页

Rexseek 3B

由 IDEA-Research 开发

这是一个图像文本到文本的转换模型，能够处理图像和文本输入，生成相应的文本输出。

文本生成图像

Transformers

开源协议:其他 #图像文本生成 #多模态转换 #视觉语言理解

下载量 186

发布时间 : 3/10/2025

模型简介

该模型主要用于将图像和文本结合的任务，能够理解图像内容并生成相关的文本描述或回答。

模型特点

多模态处理

能够同时处理图像和文本输入，实现跨模态的理解和生成。

文本生成

根据图像内容生成相关的文本描述或回答。

模型能力

图像理解

文本生成

多模态任务处理

使用案例

内容生成

图像描述生成

为图像生成详细的文字描述

生成准确反映图像内容的文本描述

视觉问答

回答关于图像内容的自然语言问题

提供与图像相关的准确答案

辅助工具

无障碍应用

为视障人士提供图像内容描述

提高视障人士的信息获取能力

🚀 RexSeek

RexSeek是一个多模态大语言模型（MLLM），旨在根据自然语言描述检测图像中的人物或物体。与专注于单实例检测的传统指称模型不同，RexSeek擅长多实例指称任务，即识别与给定描述匹配的多个人或物体。

🚀 快速开始

RexSeek是一个强大的多模态大语言模型，可基于自然语言描述检测图像中的人物或物体。以下是使用该模型的基本步骤：

安装必要的依赖和预训练模型。
按照不同的使用场景，结合其他工具（如GroundingDINO、Spacy、SAM）运行示例代码。
可以使用HumanRef基准测试对模型进行评估。

✨ 主要特性

多实例检测：能够在单张图像中识别多个匹配的实例。
强大感知能力：由先进的人物检测模型提供支持。
强大语言理解能力：利用先进的大语言模型（LLM）能力理解复杂描述。

📦 安装指南

环境安装

conda install -n rexseek python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
pip install -v -e .

下载预训练模型

我们提供了 RexSeek - 3B 的模型检查点。你可以从以下链接下载预训练模型：

[ChatRex - 3B Checkpoint](https://huggingface.co/IDEA - Research/RexSeek - 3B)

或者使用以下命令下载预训练模型：

# 从Hugging Face下载ChatRex检查点
git lfs install
git clone https://huggingface.co/IDEA-Research/RexSeek-3B IDEA-Research/RexSeek-3B

验证安装

要验证安装是否成功，请运行以下命令：

python tests/test_local_load.py

如果安装成功，你将在 tests/images 文件夹中看到一张可视化图像。

💻 使用示例

基础用法

RexSeek的基础使用需要先使用模型生成物体框，然后使用大语言模型检测物体。以下是结合不同工具的使用示例：

结合GroundingDINO

安装GroundingDINO

cd demos/
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -v -e .
mkdir weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -P weights
cd ../../../

运行示例

python demos/rexseek_grounding_dino.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --objects "person" \
    --text-threshold 0.25 \
    --box-threshold 0.25

结合GroundingDINO和Spacy

安装依赖

pip install spacy
python -m spacy download en_core_web_sm

运行示例

python demos/rexseek_grounding_dino_spacy.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text-threshold 0.25 \
    --box-threshold 0.25

在这个增强版本中：

无需指定 --objects 参数。
Spacy会自动从问题中提取名词（如“people”、“shirts”、“dogs”、“park”）。
GroundingDINO使用这些提取的名词作为检测目标。
通过问题实现更灵活、自然的交互。

结合GroundingDINO、Spacy和SAM

安装依赖

cd demos/
git clone https://github.com/IDEA-Research/SAM.git  
cd SAM
pip install -v -e .
mkdir weights
wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P weights
cd ../../../

运行示例

python demos/rexseek_grounding_dino_spacy_sam.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text-threshold 0.25 \
    --box-threshold 0.25

📚 详细文档

模型架构

简而言之：RexSeek需要模型先提出物体框，然后使用大语言模型检测物体。

RexSeek由三个关键组件组成：

视觉编码器：双分辨率特征提取（CLIP + ConvNeXt）。
人物检测器：DINO - X用于生成高质量的物体提议。
语言模型：Qwen2.5用于理解复杂的指称表达式。

输入：
- 图像：包含人物/物体的源图像。
- 文本：目标物体的自然语言描述。
- 框：来自DINO - X检测器的物体提议（可以用自定义框替换）。

输出：

与指称表达式对应的物体索引，格式如下：

<ground>referring text</ground><objects><obj1><obj2>...</objects>

Gradio演示

我们提供了RexSeek + GroundingDINO + SAM的Gradio演示。你可以运行以下命令启动Gradio演示：

python demos/gradio_demo.py \
    --rexseek-path "IDEA-Research/RexSeek-3B" \
    --gdino-config "demos/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py" \
    --gdino-weights "demos/GroundingDINO/weights/groundingdino_swint_ogc.pth" \
    --sam-weights "demos/segment-anything/weights/sam_vit_h_4b8939.pth"

HumanRef基准测试

HumanRef是一个大规模的以人物为中心的指称表达式数据集，专为自然场景中的多实例人物指称而设计。与专注于一对一物体指称的传统指称数据集不同，HumanRef支持通过自然语言描述同时指称多个人。

HumanRef的主要特点包括：

多实例指称：单个指称表达式可以对应多个人，更好地反映现实场景。
多样化的指称类型：涵盖6种主要类型的指称表达式：
- 基于属性（如性别、年龄、服装）。
- 基于位置（人与人之间或与环境的相对位置）。
- 基于交互（人与人或人与环境的交互）。
- 基于推理（复杂的逻辑组合）。
- 名人识别。
- 拒绝情况（不存在的引用）。
高质量数据：
- 34,806张高分辨率图像（>1000×1000像素）。
- 训练集中有103,028个指称表达式。
- 基准集中有6,000个精心策划的表达式。
- 每张图像平均有8.6个人。
- 每个指称表达式平均有2.2个目标框。

该数据集旨在推动以人物为中心的视觉理解和复杂多人场景中指称表达式理解的研究。

下载

你可以在 [https://huggingface.co/datasets/IDEA - Research/HumanRef](https://huggingface.co/datasets/IDEA - Research/HumanRef) 下载HumanRef基准测试数据集。

可视化

HumanRef基准测试包含6个领域，每个领域可能有多个子领域。

领域	子领域	指称数量
属性	1000_attribute_retranslated_with_mask	1000
位置	500_inner_position_data_with_mask	500
位置	500_outer_position_data_with_mask	500
名人	1000_celebrity_data_with_mask	1000
交互	500_inner_interaction_data_with_mask	500
交互	500_outer_interaction_data_with_mask	500
推理	229_outer_position_two_stage_with_mask	229
推理	271_positive_then_negative_reasoning_with_mask	271
推理	500_inner_position_two_stage_with_mask	500
拒绝	1000_rejection_referring_with_mask	1000

要可视化数据集，你可以运行以下命令：

python rexseek/tools/visualize_humanref.py \
    --anno_path "IDEA-Research/HumanRef/annotations.jsonl" \
    --image_root_dir "IDEA-Research/HumanRef/images" \
    --domain_anme "attribute" \ # attribute, position, interaction, reasoning, celebrity, rejection
    --sub_domain_anme "1000_attribute_retranslated_with_mask" \ # 1000_attribute_retranslated_with_mask, 500_inner_position_data_with_mask, 500_outer_position_data_with_mask, 1000_celebrity_data_with_mask, 500_inner_interaction_data_with_mask, 500_outer_interaction_data_with_mask, 229_outer_position_two_stage_with_mask, 271_positive_then_negative_reasoning_with_mask, 500_inner_position_two_stage_with_mask, 1000_rejection_referring_with_mask
    --vis_path "IDEA-Research/HumanRef/visualize" \
    --num_images 50 \
    --vis_mask True # True, False

评估

评估指标

我们使用三个主要指标评估指称任务：精确率、召回率和DensityF1分数。

基本指标：
- 精确率和召回率：对于每个指称表达式，如果预测的边界框与任何真实框的IoU超过阈值，则认为预测正确。按照COCO评估协议，我们报告IoU阈值从0.5到0.95，步长为0.05的平均性能。
- 基于点的评估：对于仅输出点的模型（如Molmo），如果预测点落在相应实例的掩码内，则认为预测正确。请注意，这比基于IoU的指标宽松。
- 拒绝准确率：对于拒绝子集，我们计算：
```
拒绝准确率 = 正确拒绝的表达式数量 / 表达式总数
```
  其中，正确拒绝意味着模型对不存在的引用预测没有框。
DensityF1分数：为了惩罚过度检测（预测过多的框），我们引入了DensityF1分数：

DensityF1 = (1/N) * Σ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]

其中D_i是密度惩罚因子：

D_i = min(1.0, GT_Count_i / Predicted_Count_i)

其中：

N是指称表达式的数量。
GT_Count_i是图像i中的总人数。
Predicted_Count_i是指称表达式i的预测框数量。

这个惩罚因子在模型预测的框数明显多于图像中实际人数时降低分数，从而抑制过度检测策略。

评估脚本

预测格式

在运行评估之前，你需要将模型的预测结果准备成正确的格式。每个预测应该是JSONL文件中的一个JSON行，结构如下：

{
  "id": "image_id",
  "extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}

其中：

id：与真实数据匹配的图像标识符。
extracted_predictions：[x1, y1, x2, y2]格式的边界框列表或[x, y]格式的点列表。

对于拒绝情况（即不应检测到任何人），你应该：

包含一个空列表："extracted_predictions": []。
包含一个带有空框的列表："extracted_predictions": [[]]。

运行评估

你可以使用以下命令运行评估脚本：

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path path/to/your/predictions.jsonl \
  --pred_names "Your Model Name" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/your_model_results

参数说明：

--gt_path：真实标注文件的路径。
--pred_path：你的预测文件路径。你可以提供多个路径以比较不同模型。
--pred_names：你的模型名称（用于结果显示）。
--dump_path：保存评估结果的目录，格式为markdown和JSON。

评估多个模型：要比较多个模型，请提供多个预测文件：

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl \
  --pred_names "Model 1" "Model 2" "Model 3" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/comparison

编程式使用

from rexseek.metric.recall_precision_densityf1 import recall_precision_densityf1

recall_precision_densityf1(
    gt_path="IDEA-Research/HumanRef/annotations.jsonl",
    pred_path=["path/to/your/predictions.jsonl"],
    dump_path="IDEA-Research/HumanRef/evaluation_results/your_model_results"
)

评估RexSeek

首先，我们需要运行以下命令生成预测结果：

python rexseek/evaluation/evaluate_rexseek.py \
    --model_path IDEA-Research/RexSeek-3B \
    --image_folder IDEA-Research/HumanRef/images \
    --question_file IDEA-Research/HumanRef/annotations.jsonl \
    --answers_file IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl \

然后，我们可以运行以下命令评估RexSeek模型：

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path  IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl\
  --pred_names "RexSeek-3B" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/comparison

🔧 技术细节

RexSeek模型的技术实现细节如下：

采用双分辨率的视觉编码器（CLIP + ConvNeXt）进行特征提取，能够更全面地捕捉图像信息。
使用DINO - X作为人物检测器，生成高质量的物体提议，为后续的检测提供基础。
借助Qwen2.5大语言模型理解复杂的指称表达式，实现对多实例的准确识别。

📄 许可证

数据集遵循 [OpenAI使用条款](https://openai.com/policies/terms - of - use)。
本项目中使用的大语言模型是 [Qwen/Qwen2.5 - 3B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 3B - Instruct)，遵循 [Qwen研究许可协议](https://huggingface.co/Qwen/Qwen2.5 - 3B - Instruct/blob/main/LICENSE)。
高分辨率视觉编码器使用 [laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg](https://huggingface.co/laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg)，遵循 MIT许可证。
低分辨率视觉编码器使用 [openai/clip - vit - large - patch14](https://huggingface.co/openai/clip - vit - large - patch14)，遵循 MIT许可证。

BibTeX 📚

@misc{jiang2025referringperson,
      title={Referring to Any Person}, 
      author={Qing Jiang and Lin Wu and Zhaoyang Zeng and Tianhe Ren and Yuda Xiong and Yihao Chen and Qin Liu and Lei Zhang},
      year={2025},
      eprint={2503.08507},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.08507}, 
}