idefics2_raven_finetuned開源多模態模型 - 高效解決瑞文推理矩陣問題

首頁

Idefics2 Raven Finetuned

由HuggingFaceM4開發

專門用於解決瑞文推理矩陣問題的多模態模型，基於視覺-語言基礎模型構建，驗證集準確率達91%

多模態融合

Transformers

支持多種語言開源協議:Apache-2.0 #瑞文推理矩陣 #多模態推理 #高準確率(91%)

下載量 235

發布時間 : 3/10/2024

模型概述

該模型基於SigLIP和Mistral-7B構建，通過特定數據集訓練，專門用於解決瑞文推理矩陣問題，在驗證集上表現出色。

模型特點

高準確率

在瑞文推理矩陣驗證集上達到91%的準確率

多模態基礎

基於視覺-語言基礎模型的早期檢查點構建

專門優化

針對瑞文推理矩陣問題進行了專門訓練和優化

模型能力

解決瑞文推理矩陣問題

多模態圖像理解

邏輯推理

使用案例

認知能力測試

瑞文推理測試

用於解決標準瑞文推理矩陣問題

91%的驗證準確率

教育評估

認知能力評估

可用於教育場景中的認知能力測試

🚀 AI Raven模型

本模型旨在解決瑞文推理矩陣問題，基於即將推出的多模態基礎模型的早期檢查點進行訓練。通過使用特定數據集訓練，在驗證集上達到了較高的準確率，為解決相關問題提供了有效方案。

🚀 快速開始

你可以先嚐試在線演示！

✨ 主要特性

該模型經過訓練，專門用於解決瑞文推理矩陣問題。
基於即將推出的視覺 - 語言基礎模型的早期檢查點構建。
在驗證集上，模型的準確率可達 91%。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

此代碼片段展示瞭如何使用該模型進行批量推理。當我們將模型集成到 HF Transformers 中時，許多輸入準備工作將被封裝起來。

import torch
import requests

from io import BytesIO
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

from transformers.image_utils import to_numpy_array, PILImageResampling, ChannelDimension
from transformers.image_transforms import resize, to_channel_dimension_format


DEVICE = torch.device("cuda")
PROCESSOR = AutoProcessor.from_pretrained(
    "HuggingFaceM4/tr_272_bis_opt_step_15000_merge",
    token=API_TOKEN,
)
MODEL = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceM4/tr_272_bis_opt_step_15000_merge",
    token=API_TOKEN,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to(DEVICE)
image_seq_len = MODEL.config.perceiver_config.resampler_n_latents
BOS_TOKEN = PROCESSOR.tokenizer.bos_token
BAD_WORDS_IDS = PROCESSOR.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids


def convert_to_rgb(image):
    # `image.convert("RGB")` would only work for .jpg images, as it creates a wrong background
    # for transparent images. The call to `alpha_composite` handles this case
    if image.mode == "RGB":
        return image

    image_rgba = image.convert("RGBA")
    background = Image.new("RGBA", image_rgba.size, (255, 255, 255))
    alpha_composite = Image.alpha_composite(background, image_rgba)
    alpha_composite = alpha_composite.convert("RGB")
    return alpha_composite


# The processor is the same as the Idefics processor except for the BILINEAR interpolation,
# so this is a hack in order to redefine ONLY the transform method
def custom_transform(x):
    x = convert_to_rgb(x)
    x = to_numpy_array(x)

    height, width = x.shape[:2]
    aspect_ratio = width / height
    if width >= height and width > 980:
        width = 980
        height = int(width / aspect_ratio)
    elif height > width and height > 980:
        height = 980
        width = int(height * aspect_ratio)
    width = max(width, 378)
    height = max(height, 378)

    x = resize(x, (height, width), resample=PILImageResampling.BILINEAR)
    x = PROCESSOR.image_processor.rescale(x, scale=1 / 255)
    x = PROCESSOR.image_processor.normalize(
        x,
        mean=PROCESSOR.image_processor.image_mean,
        std=PROCESSOR.image_processor.image_std
    )
    x = to_channel_dimension_format(x, ChannelDimension.FIRST)
    x = torch.tensor(x)
    return x


# Create text token inputs
image_seq = '<image>' * image_seq_len
inputs = PROCESSOR.tokenizer(
    [
        f"{BOS_TOKEN}User:<fake_token_around_image>{image_seq}<fake_token_around_image>Which figure should complete the logical sequence?<end_of_utterance>\nAssistant:",
        f"{BOS_TOKEN}User:<fake_token_around_image>{image_seq}<fake_token_around_image>Which figure should complete the logical sequence?<end_of_utterance>\nAssistant:",
    ],
    return_tensors="pt",
    add_special_tokens=False,
    padding=True,
)

# Create pixel inputs
raw_images = [
    [your_raven_puzzle_as_a_pil_image1],
    [your_raven_puzzle_as_a_pil_image2],
]
output_images = [
    [PROCESSOR.image_processor(img, transform=custom_transform) for img in img_list]
    for img_list in raw_images
]
total_batch_size = len(output_images)
max_num_images = max([len(img_l) for img_l in output_images])
max_height = max([i.size(2) for img_l in output_images for i in img_l])
max_width = max([i.size(3) for img_l in output_images for i in img_l])
padded_image_tensor = torch.zeros(total_batch_size, max_num_images, 3, max_height, max_width)
padded_pixel_attention_masks = torch.zeros(
    total_batch_size, max_num_images, max_height, max_width, dtype=torch.bool
)
for batch_idx, img_l in enumerate(output_images):
    for img_idx, img in enumerate(img_l):
        im_height, im_width = img.size()[2:]
        padded_image_tensor[batch_idx, img_idx, :, :im_height, :im_width] = img
        padded_pixel_attention_masks[batch_idx, img_idx, :im_height, :im_width] = True

inputs["pixel_values"] = padded_image_tensor
inputs["pixel_attention_mask"] = padded_pixel_attention_masks
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

generated_ids = MODEL.generate(**inputs, bad_words_ids=BAD_WORDS_IDS, max_new_tokens=10)
generated_texts = PROCESSOR.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

高級用法

模型經過專門微調以解決瑞文謎題，若未進行適當調整，我們不能保證它在該用例之外能準確運行。

📚 詳細文檔

模型詳情

屬性	詳情
開發者	Hugging Face
模型類型	多模態模型
語言（NLP）	英語
許可證	見許可證部分
父模型	SigLIP 和 mistralai/Mistral - 7B - v0.1
更多信息資源	RAVEN 數據集：數據集卡片