đ Pixtral-Large-Instruct-2411 đ
A Transformers implementation of Pixtral-Large-Instruct-2411, designed for image-text-to-text tasks.
đ Quick Start
This repository provides a Transformers implementation of Pixtral-Large-Instruct-2411.
21 Dec 2024: This model has been a LOT of fun to experiment and learn with. The model card below has been updated with changes made to this repo over the last week.
⨠Features
Architecture Differences to Pixtral 12B
Pixtral 12B has bias keys for the multi_modal_projector layers, while Pixtral Large does not. This conversion does not include those bias keys, aligning with the keys in the original Pixtral Large upload from Mistral. The model's config.json
file includes "multimodal_projector_bias": false
to indicate this. Note: If anyone in the community confirms that initializing these keys with zero values is better, I'm happy to reupload without excluding them.
Tokenizer
This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes, so make sure to use the correct one.
Prompting / Chat Template
The included chat_template.json
supports all of Mistral's defined features with some personal additions.
I believe this implementation offers great flexibility for using the model, and it has worked well in my testing.
Example (line breaks added for readability)
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
[INST] [IMG]<user message>
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
[IMG]<assistant response>
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
</s>[INST] <user message>[/INST]
System Prompts:
Messages with the "system" role will be parsed as [SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]
anywhere in the chat history.
This seems to work well for passing extra instructions at various depths and keeps instructions separate from the conversation.
Allowing Non-Alternating Roles:
Multiple consecutive user messages can be provided, each separated by [INST][/INST]
. This can be useful in group conversations or environments where multiple user messages can be sent before invoking the model. Having a [/INST]
between each message helps the model focus on the last message while retaining knowledge of previous ones.
Image Inputs Everywhere:
Images can now be sent in user, assistant, and tool result messages, and it actually works. I conducted tests like including an image in an assistant reply 10 - 15 messages back and asking the assistant to recall it, and it could accurately describe the image.
This flexibility enables interesting applications. For example, if you define a tool for image generation:
- The tool invokes an image generation API/model.
- The image is returned in the tool result message.
- The model responds with a message considering the generated image.
- You can have further conversations about the generated image or make revisions with the model aware of what was created.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
When loading the model in Transformers, you may need to add some handling to ensure that the lack of mmproj
bias is respected for proper vision input.
Most of my testing has been done using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.

đ Documentation
Quantizations
EXL2 quantizations are available in different sizes here. You'll need to use the dev branch of ExLlamaV2 for vision input.
đ License
Property |
Details |
Model Type |
image-text-to-text |
Base Model |
mistralai/Pixtral-Large-Instruct-2411 |
Library Name |
transformers |
Supported Languages |
en, fr, de, es, it, pt, zh, ja, ru, ko |
Inference |
false |