MSI-Net开源视觉显著性模型 - 精准预测自然图像上人类注视点

首页

MSI Net

由 alexanderkroner 开发

MSI-Net是一种视觉显著性模型，通过基于眼动数据训练的上下文编码器-解码器网络预测人类在自然图像上的注视点。

图像分割开源协议:MIT #视觉显著性预测 #多尺度特征提取 #眼动追踪分析

下载量 1,380

发布时间 : 5/10/2024

模型简介

MSI-Net是一种基于卷积神经网络架构的视觉显著性预测模型，包含ASPP模块以捕获多尺度特征，并结合全局场景信息实现准确预测。

模型特点

多尺度特征提取

通过ASPP模块中的不同膨胀率卷积层并行捕获多尺度特征

全局场景信息整合

将生成的表示与全局场景信息相结合，提高预测准确性

轻量级设计

约2500万参数，适合计算资源有限的应用场景

模型能力

视觉显著性预测

人类注视点预测

图像分析

使用案例

人机交互

界面设计评估

预测用户可能关注的界面区域，优化设计布局

提高用户界面设计的有效性

广告效果分析

广告注意力预测

分析广告图像中最可能吸引注意力的区域

优化广告内容布局

🚀 MSI-Net

🤖 MSI-Net 是一个视觉显著性模型，它利用基于眼动数据训练的上下文编解码器网络，预测人类在自然图像上的注视点。该模型采用卷积神经网络架构，结合多尺度特征捕获和全局场景信息，实现精准预测，且参数规模适合计算资源有限的应用。

📖 用于视觉显著性预测的上下文编解码器网络

🤗 可以在 HuggingFace Spaces 上找到该模型的演示。

结果示例

🚀 快速开始

安装依赖

要安装所需的依赖项，可以使用 pip 或 conda：

pip install -r requirements.txt

conda env create -f requirements.yml

使用示例

基础用法

# 导入依赖
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from huggingface_hub import snapshot_download

# 下载仓库文件
hf_dir = snapshot_download(repo_id="alexanderkroner/MSI-Net")

# 加载显著性模型
model = tf.keras.models.load_model(hf_dir)

# 加载用于预处理输入和后处理输出的函数
def get_target_shape(original_shape):
    original_aspect_ratio = original_shape[0] / original_shape[1]

    square_mode = abs(original_aspect_ratio - 1.0)
    landscape_mode = abs(original_aspect_ratio - 240 / 320)
    portrait_mode = abs(original_aspect_ratio - 320 / 240)

    best_mode = min(square_mode, landscape_mode, portrait_mode)

    if best_mode == square_mode:
        target_shape = (320, 320)
    elif best_mode == landscape_mode:
        target_shape = (240, 320)
    else:
        target_shape = (320, 240)

    return target_shape


def preprocess_input(input_image, target_shape):
    input_tensor = tf.expand_dims(input_image, axis=0)

    input_tensor = tf.image.resize(
        input_tensor, target_shape, preserve_aspect_ratio=True
    )

    vertical_padding = target_shape[0] - input_tensor.shape[1]
    horizontal_padding = target_shape[1] - input_tensor.shape[2]

    vertical_padding_1 = vertical_padding // 2
    vertical_padding_2 = vertical_padding - vertical_padding_1

    horizontal_padding_1 = horizontal_padding // 2
    horizontal_padding_2 = horizontal_padding - horizontal_padding_1

    input_tensor = tf.pad(
        input_tensor,
        [
            [0, 0],
            [vertical_padding_1, vertical_padding_2],
            [horizontal_padding_1, horizontal_padding_2],
            [0, 0],
        ],
    )

    return (
        input_tensor,
        [vertical_padding_1, vertical_padding_2],
        [horizontal_padding_1, horizontal_padding_2],
    )


def postprocess_output(
    output_tensor, vertical_padding, horizontal_padding, original_shape
):
    output_tensor = output_tensor[
        :,
        vertical_padding[0] : output_tensor.shape[1] - vertical_padding[1],
        horizontal_padding[0] : output_tensor.shape[2] - horizontal_padding[1],
        :,
    ]

    output_tensor = tf.image.resize(output_tensor, original_shape)

    output_array = output_tensor.numpy().squeeze()
    output_array = plt.cm.inferno(output_array)[..., :3]

    return output_array

# 加载并预处理示例图像
input_image = tf.keras.utils.load_img(hf_dir + "/example.jpg")
input_image = np.array(input_image, dtype=np.float32)

original_shape = input_image.shape[:2]
target_shape = get_target_shape(original_shape)

input_tensor, vertical_padding, horizontal_padding = preprocess_input(
    input_image, target_shape
)

# 将输入张量输入到模型中
output_tensor = model(input_tensor)["output"]

# 后处理并可视化输出
saliency_map = postprocess_output(
    output_tensor, vertical_padding, horizontal_padding, original_shape
)

alpha = 0.65

blended_image = alpha * saliency_map + (1 - alpha) * input_image / 255

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(input_image / 255)
plt.title("输入图像")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(blended_image)
plt.title("显著性图")
plt.axis("off")

plt.tight_layout()
plt.show()

✨ 主要特性

多尺度特征捕获：采用带有不同膨胀率的多个卷积层的 ASPP 模块，并行捕获多尺度特征。
结合全局信息：将多尺度特征表示与全局场景信息相结合，实现更准确的视觉显著性预测。
参数规模适中：大约包含 2500 万个参数，适合计算资源有限的应用。

📚 详细文档

数据集

在对注视数据进行模型训练之前，编码器的权重是从在 ImageNet 分类任务上预训练的 VGG16 骨干网络初始化的。然后，该模型在 SALICON 数据集上进行训练，该数据集包含鼠标移动记录，作为注视测量的代理。最后，可以在人类眼动跟踪数据上对权重进行微调。因此，MSI-Net 也在以下其中一个数据集上进行了训练，不过这里我们仅提供 SALICON 基础模型：

数据集	图像数量	每张图像的观察者数量	观看时长	记录类型
SALICON	10000	16	5s	鼠标跟踪
MIT1003	1003	15	3s	眼动跟踪
CAT2000	4000	24	5s	眼动跟踪
DUT-OMRON	5168	5	2s	眼动跟踪
PASCAL-S	850	8	2s	眼动跟踪
OSIE	700	15	3s	眼动跟踪
FIWI	149	11	5s	眼动跟踪

本模型的评估结果可在原始的 MIT 显著性基准和更新后的 MIT/Tübingen 显著性基准上查看。后者的结果来自预测显著性图的概率表示，并进行了特定指标的后处理，以便进行公平的模型比较。

局限性

泛化性问题：MSI-Net 是在自由观看范式下收集的人类注视数据上训练的。因此，预测的显著性图可能无法推广到在实验中接受任务指令的观察者。此外，训练数据主要由自然图像组成。因此，对于特定类型的图像（如分形、图案）或对抗性示例的注视预测可能不太准确。
潜在偏差：基于显著性的裁剪算法，曾在 2018 年至 2021 年间应用于社交媒体平台 Twitter 上上传的图像，已显示出种族和性别方面的偏差。因此，谨慎使用显著性模型并认识到其应用中可能涉及的潜在风险非常重要。

引用

如果您发现此代码或模型有用，请引用以下论文：

@article{kroner2020contextual,
  title={Contextual encoder-decoder network for visual saliency prediction},
  author={Kroner, Alexander and Senden, Mario and Driessens, Kurt and Goebel, Rainer},
  url={http://www.sciencedirect.com/science/article/pii/S0893608020301660},
  doi={https://doi.org/10.1016/j.neunet.2020.05.004},
  journal={Neural Networks},
  publisher={Elsevier},
  year={2020},
  volume={129},
  pages={261--270},
  issn={0893-6080}
}