M

Magma 8B

Developed by microsoft
Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.
Downloads 4,526
Release Time : 2/23/2025

Model Overview

Magma is a foundational model for multimodal AI agents. By introducing token sets and token trajectory techniques, it learns spatiotemporal localization and planning capabilities from vast amounts of unlabeled video data, making it suitable for various intelligent tasks such as UI navigation and robotic manipulation.

Model Features

Digital & Physical World Interaction
The first multimodal AI agent model capable of handling complex interactions in both virtual and real-world environments.
Versatile Unified Architecture
A single model with integrated capabilities for visual understanding, language generation, and action planning.
Spatiotemporal Localization & Planning
Learns spatiotemporal localization through token trajectory techniques from video data.
Scalable Pretraining
Can extend learning from massive unlabeled video data, demonstrating strong generalization capabilities.

Model Capabilities

Image understanding
Video understanding
Text generation
UI navigation
Robotic manipulation control
Game control
Spatial reasoning
Multimodal interaction

Use Cases

Smart Device Interaction
Mobile UI Navigation
Automatically operates smartphone interfaces based on voice commands
Successfully demonstrated weather queries and airplane mode settings in demos
Robot Control
Object Grasping
Controls robots to grasp specific objects based on visual input
Successfully grasped hot dogs and mushrooms in demonstrations
Game AI
Game Control
Understands game states through visual input and generates control commands
Outperformed LLaVA and GPT4o-mini in green square collection tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase