BLIP-Math开源视觉语言模型 - 免费用于数学题目分析，支持文本生成与评分反馈

Home

BLIP Math

Developed by uf-aice-lab

基于数学多模态数据集微调的视觉语言模型，具备文本生成和评分功能，专为数学题目分析与反馈设计

图像生成文本

Transformers

Open Source License:Bsd-3-clause #数学题解答评估 #多模态图像文本处理 #手写识别反馈

Downloads 77

Release Time : 9/14/2023

Model Overview

该模型是BLIP框架的数学专用版本，通过微调增加了评分模块，支持对数学题目文本、学生作答和手写图像的联合分析，可生成反馈建议

Model Features

双输出头设计

同时具备文本生成和评分功能，可输出描述性反馈和量化评分

数学专用优化

基于数学多模态数据集微调，擅长处理数学题目和手写解答

多模态输入处理

支持题目文本、学生作答、题目图像和学生手写图像四种输入源的联合分析

Model Capabilities

图像描述生成

手写数学公式识别

解题过程分析

自动评分

教学反馈生成

Use Cases

教育科技

自动批改数学作业

通过分析学生手写解答图像和文本答案，自动生成评分和反馈

提高教师批改效率，提供即时学习反馈

智能辅导系统

识别学生解题过程中的错误步骤，生成针对性指导建议

个性化学习支持，提升学习效果

🚀 BLIP - Math

本模型在数学多模态数据集上进行了微调，它包含两个输出头：文本生成和评分。我们提供了模型文本生成部分的权重文件 pytorch_model.bin。

你需要 4 个输入源，包括两个文本输入和两个图像输入：problem_body、student_response、question_image 和 student_image。

要进行条件文本生成：

按以下方式拼接文本： text = 'problem:' + ' ' + [problem_body] + ' ' + 'student:' + [student_response] + ' ' + 'response:'
将 [question_image] 和 [student_image] 垂直拼接，同时将 [question_image] 置于上方，并选择两者中较大的图像尺寸。

对于所有其他使用情况，请遵循与 BLIP 模型相同的步骤。

如果你有任何进一步的问题或需要特定代码或实现细节方面的帮助，请随时提问。

🚀 快速开始

BLIP：用于统一视觉 - 语言理解和生成的引导式语言 - 图像预训练

这是一个在 COCO 数据集上进行图像字幕预训练的模型卡片 - 基础架构（使用 ViT 基础骨干网络）。


图片来源于 BLIP 官方仓库

简要概述

来自论文的作者在摘要中写道：

视觉 - 语言预训练（VLP）提升了许多视觉 - 语言任务的性能。然而，大多数现有的预训练模型仅在基于理解的任务或基于生成的任务中表现出色。此外，性能的提升在很大程度上是通过扩大从网络收集的含噪图像 - 文本对数据集来实现的，而这是一种次优的监督来源。在本文中，我们提出了 BLIP，一个新的 VLP 框架，它可以灵活地迁移到视觉 - 语言理解和生成任务中。BLIP 通过引导式字幕有效地利用了含噪的网络数据，其中一个字幕生成器生成合成字幕，一个过滤器去除含噪的字幕。我们在广泛的视觉 - 语言任务中取得了最先进的结果，如图像 - 文本检索（平均召回率@1 提高 2.7%）、图像字幕（CIDEr 提高 2.8%）和视觉问答（VQA 分数提高 1.6%）。BLIP 在以零样本方式直接迁移到视频 - 语言任务时也表现出了强大的泛化能力。代码、模型和数据集已发布。

✨ 主要特性

你可以使用此模型进行条件和无条件图像字幕生成。

💻 使用示例

基础用法

使用 PyTorch 模型

在 CPU 上运行模型

点击展开

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件图像字幕生成
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 无条件图像字幕生成
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

高级用法

在 GPU 上运行模型

全精度

点击展开

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件图像字幕生成
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 无条件图像字幕生成
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

半精度 (`float16`)

点击展开

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件图像字幕生成
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 无条件图像字幕生成
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📚 详细文档

BibTex 和引用信息

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}