A

AKI 4B Phi 3.5 Mini

Developed by Sony
AKI is a multimodal foundation model that achieves cross-modal mutual attention (MMA) by unlocking the causal attention mechanism in LLMs, addressing vision-language misalignment without additional parameters or training time.
Downloads 25
Release Time : 3/12/2025

Model Overview

This model integrates visual and textual modalities for image-to-text conversion, excelling particularly in visual scene understanding and multimodal reasoning tasks.

Model Features

Cross-modal Mutual Attention (MMA)
Unlocks the causal attention mechanism in LLMs to integrate textual modality information into visual modality, solving vision-language misalignment
Zero Parameter Increase
Innovative architecture achieves multimodal fusion without additional parameters or training time
Multi-task Adaptation
Instruction fine-tuned on 12 benchmark datasets, supporting a wide range of vision-language tasks

Model Capabilities

Image scene description
Visual question answering
Multimodal reasoning
Image OCR understanding
Medical image analysis
3D visual understanding

Use Cases

Intelligent Assistants
Image Scene Description
Automatically generates detailed textual descriptions of image content
Example output: The picture shows the autumn beauty of a park, with colorful fallen leaves covering the path...
Medical Assistance
Multimodal Diagnosis
Analyzes medical images and generates diagnostic suggestions
Achieved 40.8% accuracy in evaluations (AKI-4B version)
EdTech
Visual Math Problem Solving
Interprets charts containing mathematical formulas and answers related questions
Achieved 32.1% accuracy in visual math evaluations (AKI-4B version)
Featured Recommended AI Models
ยฉ 2025AIbase