Pixtral-12b开源多模态模型 - 免费处理图像文本、生成详细描述

首页

Pixtral 12b

由 mistral-community 开发

Pixtral是一个基于Mistral架构的多模态模型，能够处理图像和文本输入，生成详细的文本描述。

图像生成文本

Transformers

开源协议:Apache-2.0 #多图文本生成 #视觉-语言交互 #复杂场景理解

下载量 31.93k

发布时间 : 9/13/2024

模型简介

Pixtral是一个12B参数的多模态模型，专为图像到文本的任务设计，能够理解图像内容并生成详细的描述或回答问题。

模型特点

多模态能力

能够同时处理图像和文本输入，生成连贯的文本输出。

高参数规模

12B参数的规模使其具备强大的理解和生成能力。

灵活的输入格式

支持通过URL或本地路径加载图像，并可通过聊天模板格式化输入。

模型能力

图像描述生成

多图像分析

图像问答

多模态对话

使用案例

内容生成

图像描述生成

为单张或多张图像生成详细的文本描述。

生成包含图像细节、背景和情感色彩的描述文本。

问答系统

图像相关问题回答

基于图像内容回答用户提出的问题。

提供与图像内容相关的准确答案和解释。

🚀 图像文本转换模型Pixtral

Pixtral是与Transformers兼容的检查点模型，可实现图像文本到文本的转换。使用前请确保从源代码安装或等待v4.45版本发布！

🚀 快速开始

在使用Pixtral模型前，请确保从源代码安装或等待v4.45版本发布。

💻 使用示例

基础用法

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
"https://picsum.photos/id/237/400/300", 
"https://picsum.photos/id/231/200/300", 
"https://picsum.photos/id/27/500/500",
"https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

运行上述代码后，你将得到类似以下的输出：

"""
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""

高级用法

你还可以使用聊天模板来格式化Pixtral的聊天历史记录。确保传递给processor的images参数中的图像顺序与聊天记录中的顺序一致，以便模型理解每个图像的位置。

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image"}, 
        {"type": "text", "content": "live here?"}, 
        {"type": "image"}
      ]
    }
]

prompt = processor.apply_chat_template(chat)
inputs = processor(text=prompt, images=[url_dog, url_mountain], return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

从transformers >= v4.48版本开始，你还可以将图像URL或本地路径传递给对话历史记录，让聊天模板处理其余部分。聊天模板将为你加载图像，并返回torch.Tensor格式的输入，你可以直接将其传递给model.generate()。

chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image", "url": url_dog}, 
        {"type": "text", "content": "live here?"}, 
        {"type": "image", "url" : url_mountain}
      ]
    }
]

inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

运行上述代码后，你应该会得到类似以下的输出：

Can this animallive here?Certainly! Here are some details about the images you provided:

### First Image
- **Description**: The image shows a black dog lying on a wooden surface. The dog has a curious expression with its head tilted slightly to one side.
- **Details**: The dog appears to be a young puppy with soft, shiny fur. Its eyes are wide and alert, and it has a playful demeanor.
- **Context**: This image could be used to illustrate a pet-friendly environment or to showcase the dog's personality.

### Second Image
- **Description**: The image depicts a serene landscape with a snow-covered hill in the foreground. The sky is painted with soft hues of pink, orange, and purple, indicating a sunrise or sunset.
- **Details**: The hill is covered in a blanket of pristine white snow, and the horizon meets the sky in a gentle curve. The scene is calm and peaceful.
- **Context**: This image could be used to represent tranquility, natural beauty, or a winter wonderland.

### Combined Context
If you're asking whether the dog can "live here," referring to the snowy landscape, it would depend on the breed and its tolerance to cold weather. Some breeds, like Huskies or Saint Bernards, are well-adapted to cold environments, while others might struggle. The dog in the first image appears to be a breed that might prefer warmer climates.

Would you like more information on any specific aspect?

虽然输入中的间距可能看起来被打乱了，但这是因为我们为了显示而跳过了特殊标记。实际上，“Can this animal”和“live here”是由图像标记正确分隔的。尝试包含特殊标记进行解码，以查看模型实际看到的内容！