Pixtral-Large-Instruct-2411 Open-Source Multi-Modal Model - Supports Image and Text Input and Multi-Language Processing

Pixtral Large Instruct 2411

Developed by nintwentydo

Pixtral-Large-Instruct-2411 is a multimodal instruction fine-tuned model based on MistralAI technology, supporting image and text input with multilingual processing capabilities.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal Instruction Understanding #Multilingual Interaction #Image-Text Fusion

Downloads 23

Release Time : 12/17/2024

Model Overview

This is a multimodal large language model capable of processing image and text inputs to generate text outputs. Specifically designed for instruction-following tasks, it supports complex dialogue interactions and tool calling.

Model Features

Multimodal Processing Capability

Can simultaneously process image and text inputs, flexibly utilizing visual information in conversations

Multilingual Support

Supports text processing in 10 major languages

Flexible Tool Calling

Supports defining and calling external tools, and can process tool return results (including images)

Long Context Memory

Can remember and reference image content from earlier in the conversation history

Model Capabilities

Multimodal Dialogue

Multilingual Text Generation

Image Understanding and Description

Tool Calling and Integration

Complex Instruction Following

Use Cases

Creative Applications

Image-Assisted Creation

Creative writing or story generation based on user-provided images

Can generate coherent narrative content incorporating visual elements

Technical Support

Visual Q&A

Technical problem diagnosis or solutions based on user-provided images

Can accurately understand image content and provide relevant suggestions

Multilingual Services

Cross-Language Communication Assistance

Provides translation and interpretation services in multilingual environments

Supports mutual translation and interpretation in 10 languages

🚀 Pixtral-Large-Instruct-2411 🎉

A Transformers implementation of Pixtral-Large-Instruct-2411, designed for image-text-to-text tasks.

🚀 Quick Start

This repository provides a Transformers implementation of Pixtral-Large-Instruct-2411.

21 Dec 2024: This model has been a LOT of fun to experiment and learn with. The model card below has been updated with changes made to this repo over the last week.

✨ Features

Architecture Differences to Pixtral 12B

Pixtral 12B has bias keys for the multi_modal_projector layers, while Pixtral Large does not. This conversion does not include those bias keys, aligning with the keys in the original Pixtral Large upload from Mistral. The model's config.json file includes "multimodal_projector_bias": false to indicate this. Note: If anyone in the community confirms that initializing these keys with zero values is better, I'm happy to reupload without excluding them.

Tokenizer

This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes, so make sure to use the correct one.

Prompting / Chat Template

The included chat_template.json supports all of Mistral's defined features with some personal additions.

I believe this implementation offers great flexibility for using the model, and it has worked well in my testing.

Example (line breaks added for readability)

<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]  
[INST] [IMG]<user message>  
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]  
[IMG]<assistant response>  
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]  
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]  
</s>[INST] <user message>[/INST]

System Prompts:
Messages with the "system" role will be parsed as [SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT] anywhere in the chat history.

This seems to work well for passing extra instructions at various depths and keeps instructions separate from the conversation.

Allowing Non-Alternating Roles:
Multiple consecutive user messages can be provided, each separated by [INST][/INST]. This can be useful in group conversations or environments where multiple user messages can be sent before invoking the model. Having a [/INST] between each message helps the model focus on the last message while retaining knowledge of previous ones.

Image Inputs Everywhere:
Images can now be sent in user, assistant, and tool result messages, and it actually works. I conducted tests like including an image in an assistant reply 10 - 15 messages back and asking the assistant to recall it, and it could accurately describe the image.

This flexibility enables interesting applications. For example, if you define a tool for image generation:

The tool invokes an image generation API/model.
The image is returned in the tool result message.
The model responds with a message considering the generated image.
You can have further conversations about the generated image or make revisions with the model aware of what was created.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

When loading the model in Transformers, you may need to add some handling to ensure that the lack of mmproj bias is respected for proper vision input.

Most of my testing has been done using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.

📚 Documentation

Quantizations

EXL2 quantizations are available in different sizes here. You'll need to use the dev branch of ExLlamaV2 for vision input.

📄 License

License Type: other
License Name: mrl
License Link: https://mistral.ai/licenses/MRL-0.1.md

Property	Details
Model Type	image-text-to-text
Base Model	mistralai/Pixtral-Large-Instruct-2411
Library Name	transformers
Supported Languages	en, fr, de, es, it, pt, zh, ja, ru, ko
Inference	false

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご