AKI-4B-phi-3.5-mini Open-Source Multimodal Foundation Model - Solve the Visual-Language Mismatch Problem without Extra Costs

AKI 4B Phi 3.5 Mini

Developed by Sony

AKI is a multimodal foundation model that achieves cross-modal mutual attention (MMA) by unlocking the causal attention mechanism in LLMs, addressing vision-language misalignment without additional parameters or training time.

Image-to-Text

Safetensors

English#Cross-modal Mutual Attention #Zero-shot Visual Understanding #Multimodal Causal Reasoning

Downloads 25

Release Time : 3/12/2025

Model Overview

This model integrates visual and textual modalities for image-to-text conversion, excelling particularly in visual scene understanding and multimodal reasoning tasks.

Model Features

Cross-modal Mutual Attention (MMA)

Unlocks the causal attention mechanism in LLMs to integrate textual modality information into visual modality, solving vision-language misalignment

Zero Parameter Increase

Innovative architecture achieves multimodal fusion without additional parameters or training time

Multi-task Adaptation

Instruction fine-tuned on 12 benchmark datasets, supporting a wide range of vision-language tasks

Model Capabilities

Image scene description

Visual question answering

Multimodal reasoning

Image OCR understanding

Medical image analysis

3D visual understanding

Use Cases

Intelligent Assistants

Image Scene Description

Automatically generates detailed textual descriptions of image content

Example output: The picture shows the autumn beauty of a park, with colorful fallen leaves covering the path...

Medical Assistance

Multimodal Diagnosis

Analyzes medical images and generates diagnostic suggestions

Achieved 40.8% accuracy in evaluations (AKI-4B version)

EdTech

Visual Math Problem Solving

Interprets charts containing mathematical formulas and answers related questions

Achieved 32.1% accuracy in visual math evaluations (AKI-4B version)

🚀 AKI Model Card

AKI is a multimodal foundation model that unlocks causal attention in the LLM into modality - mutual attention (MMA), addressing vision - language misalignment without extra parameters and increased training time.

📚 Documentation

Model Details

Model Descriptions

Vision Encoder: [google/siglip - so400m - patch14 - 384](https://huggingface.co/google/siglip - so400m - patch14 - 384)
Vision - Language Connector: Perceiver Resampler
Language Decoder (LLM): [microsoft/Phi - 3.5 - mini - instruct](https://huggingface.co/microsoft/Phi - 3.5 - mini - instruct)
Pretraining Datasets: [Blip3 - kale](https://huggingface.co/datasets/Salesforce/blip3 - kale) and [Blip3 - OCR - 200m](https://huggingface.co/datasets/Salesforce/blip3 - ocr - 200m)
SFT Datasets: VQAv2, GQA, VSR, OCRVQA, A - OKVQA, ScienceQA, RefCOCO, RefCOCOg, RefCOCO+, VisualGnome, LLaVA - 150k

Model Sources

Repository: GitHub
Paper: arXiv

💻 Usage Examples

Input Format

Given the nature of the training data, the AKI model is best suited for prompts using the chat format as follows:

<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>
<|user|>
<image>
Describe the scene of this image.
<|end|>
<|assistant|>

The image captures a beautiful autumn day in a park, with a pathway covered in a vibrant carpet of fallen leaves. The leaves are in various shades of red, orange, yellow, and brown, creating a warm and colorful atmosphere. The path is lined with trees displaying beautiful autumn foliage, adding to the picturesque setting. ...

Inference Example

Please refer to the notebook for the zero - shot inference. To build a local demo website, please refer to local_demo.py.

💡 Usage Tip

For the training scripts, please refer to the GitHub repo.

📊 Evaluation Results

Main Comparisons with the Same Configurations (Table 1)

	MME^P	MME^C	MMB	SEED^I	LLaVA^W	MMMU	MathV^mini	POPE	MM - Vet	RealWorldQA	CV - Bench^2D	CV - Bench^3D
(I&T)_PT + (I&T)_SFT	1226.3	258.2	64.9	64.1	47.0	31.1	24.2	79.8	24.3	50.6	45.2	54.3
CCA [Xing et al., 2024]	1212.7	243.6	67.4	65.3	54.0	34.6	25.6	81.9	29.0	52.7	56.0	62.8
(w/o T&I)_PT	1046.3	226.4	31.7	45.1	38.1	27.2	23.8	65.0	17.2	40.1	53.2	54.8
(w/o I&T)_PT	1013.2	208.6	32.0	43.3	37.9	27.7	22.4	70.4	20.6	39.5	55.4	53.0
(w/o T&I)_SFT	1194.8	289.3	58.5	61.1	40.2	28.0	21.9	79.0	22.8	47.8	41.4	63.0
(w/o I&T)_SFT	1166.2	264.3	58.4	60.8	36.9	26.7	23.1	76.8	20.4	46.9	43.3	61.2
DOT (Ours)	1267.8	251.4	43.8	54.7	47.5	30.7	25.6	82.7	25.0	50.5	52.2	58.1
MMA (Ours)	1363.7	315.4	71.8	67.1	59.6	37.3	26.4	82.7	30.2	52.3	57.8	64.1
Improvements	10.9%	29.5%	4.3%	2.8%	10.4%	7.8%	3.1%	1%	4.1%	-	3.2%	2.1%

AKI - 4B (Table 2)

	MME^P	MME^C	MMB	SEED^I	LLaVA^W	MMMU	MathV^mini	POPE	MM - Vet	RealWorldQA	CV - Bench^2D	CV - Bench^3D
AKI - 4B	1491.9	362.9	73.1	69.4	74.6	38.7	32.1	86.9	40.8	58.9	62.1	71.8

⚖️ Ethical Considerations

⚠️ Important Note

This section is mainly taken from the [xgen - mm](https://huggingface.co/Salesforce/xgen - mm - phi3 - mini - base - r - v1/blob/main/README.md) models.

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety.

📄 License

Our code and weights are released under the CC - BY - NC 4.0 license.

The copyrights of the pre - training and finetuning data remain with the original data owner.

📚 Citations

@misc{wywang2025AKI,
      title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, 
      author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi},
      year={2025},
      eprint={2503.02597},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.02597}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご