Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.
Magma is a foundational model for multimodal AI agents. By introducing token sets and token trajectory techniques, it learns spatiotemporal localization and planning capabilities from vast amounts of unlabeled video data, making it suitable for various intelligent tasks such as UI navigation and robotic manipulation.
Model Features
Digital & Physical World Interaction
The first multimodal AI agent model capable of handling complex interactions in both virtual and real-world environments.
Versatile Unified Architecture
A single model with integrated capabilities for visual understanding, language generation, and action planning.
Spatiotemporal Localization & Planning
Learns spatiotemporal localization through token trajectory techniques from video data.
Scalable Pretraining
Can extend learning from massive unlabeled video data, demonstrating strong generalization capabilities.
Model Capabilities
Image understanding
Video understanding
Text generation
UI navigation
Robotic manipulation control
Game control
Spatial reasoning
Multimodal interaction
Use Cases
Smart Device Interaction
Mobile UI Navigation
Automatically operates smartphone interfaces based on voice commands
Successfully demonstrated weather queries and airplane mode settings in demos
Robot Control
Object Grasping
Controls robots to grasp specific objects based on visual input
Successfully grasped hot dogs and mushrooms in demonstrations
Game AI
Game Control
Understands game states through visual input and generates control commands
Outperformed LLaVA and GPT4o-mini in green square collection tasks
đ Magma-8B: A Foundation Model for Multimodal AI Agents
Magma-8B is a multimodal agentic AI model. It can generate text based on input text and images. The model is developed for research, aiming to share knowledge and speed up research in multimodal AI.
đ Quick Start
To start using the model, first ensure that transformers and torch are installed, and also install the following dependencies:
pip install torchvision Pillow open_clip_torch
â ī¸ Important Note
You need to install our customized transformers lib:
import torch
from PIL import Image
from io import BytesIO
import requests
from transformers import AutoModelForCausalLM, AutoProcessor
# Load the model and processor
dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")
# Inference
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(BytesIO(requests.get(url, stream=True).content))
image = image.convert("RGB")
convs = [
{"role": "system", "content": "You are agent that can see, talk and act."},
{"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(dtype)
generation_args = {
"max_new_tokens": 128,
"temperature": 0.0,
"do_sample": False,
"use_cache": True,
"num_beams": 1,
}
with torch.inference_mode():
generate_ids = model.generate(**inputs, **generation_args)
generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
print(response)
⨠Features
Agents
UI Navigation
What's weather in Seattle? & turn on flight mode
Share and message this to Bob Steve. Click send button
Robot Manipulation
Pick Place Hotdog Sausage
Put Mushroom Place Pot
Push Cloth Left to Right (Out-of-Dist.)
Gaming
Task: Model controls the robot to collect green blocks.
Magma v.s. LLaVA-OneVision
Magma v.s. GPT4o-minni
Model Details
Model Description
Magma is a multimodal agentic AI model that can generate text based on the input text and image. The model is designed for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, in particular the multimodal agentic AI. The main innovation of this model lies on the introduction of two technical innovations: Set-of-Mark and Trace-of-Mark, and the leverage of a large amount of unlabeled video data to learn the spatial-temporal grounding and planning. Please refer to our paper for more technical details.
Highlights
Digital and Physical Worlds: Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
Versatile Capabilities: Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
State-of-the-art Performance: Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
Scalable Pretraining Strategy: Magma is designed to be learned scalably from unlabeled videos in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.
In addition to the text-related preprocessing, we mainly undertake the following image and video preprocessing steps:
UI Grounding and Navigation Data: For each UI screenshot, we extract the bounding boxes for the UI elements, and apply Set-of-Mark Prompting to overlay numeric marks on the raw image. The model is trained to generate the UI grounding text based on the image and the Set-of-Mark prompts.
Instruction Video Data: For each video clip, we apply Co-Tracker to extract the grid traces and then apply filtering algorithm to remove the noisy or static points. For videos that bear camera motion, we further apply homography transformation to stabilize the video clips. In the end, we assign a numeric mark for each trace which gives us a set of trace-of-mark. The model is trained to generate the trace-of-mark given the video clips and instructional text.
Robotics Manipulation Data: For robotics data in Open-X Embodiment, we extract the 7 DoF robot gripper state and also extract the trace-of-mark from the video clips. Similar filtering and stabilization steps are applied to the video clips. The model is trained to generate the robot manipulation action as well as the trace-of-mark given the video clips and instructional text.
After all these preprocessing, we combine them with existing text annotations to form our final multimodal training data. We refer to our paper for more technical details.
Training Hyperparameters
We used bf16 mixed precision for training on H100s and MI300s. We used the following hyperparameters for training:
Batch size: 1024
Learning rate: 1e-5
Max sequence length: 4096
Resolution: maximally 1024x1024 for image, 512x512 for video frame.
Pretraining Epochs: 3
Evaluation
Testing Data, Factors & Metrics
Zero-shot Testing Data
We evaluate the model's zero-shot performance on the following datasets:
We follow the individual dataset's evaluation metrics for the evaluation. Please refer to the original dataset for more details.
Results on Agentic Intelligence
Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
Model
VQAv2
TextVQA
POPE
SS-Mobile
SS-Desktop
SS-Web
VWB-Ele-G
VWB-Act-G
SE-Google Robot
SE-Bridge
GPT-4V
77.2
78.0
n/a
23.6
16.0
9.0
67.5
75.7
-
-
GPT-4V-OmniParser
n/a
n/a
n/a
71.1
45.6
58.5
-
-
-
-
LLava-1.5
78.5
58.2
85.9
-
-
-
12.1
13.6
-
-
LLava-Next
81.3
64.9
86.5
-
-
-
15.0
8.7
-
-
Qwen-VL
78.8
63.8
n/a
6.2
6.3
3.0
14.0
0.7
-
-
Qwen-VL-Chat
78.2
61.5
n/a
-
-
-
-
-
-
-
Fuyu
74.2
n/a
n/a
21.2
20.8
19.2
19.4
15.5
-
-
SeeClick
-
-
-
65.0
51.1
44.1
9.9
1.9
-
-
Octo
-
-
-
-
-
-
-
-
-
-
RT-1-X
-
-
-
-
-
-
-
-
6.0
15.9
OpenVLA
-
-
-
-
-
-
-
-
34.2
1.1
Magma-8B
80.0
66.5
87.4
59.5
64.1
60.6
96.3
71.8
52.3
35.4
Notes: SS - ScreenSpot, VWB - VisualWebArena, SE - SimplerEnv
Technical Specifications
Model Architecture and Objective
Language Model: We use Meta LLama-3 as the backbone LLM.
Vision Encoder: We use CLIP-ConvneXt-XXLarge trained by LAION team as the vision encoder to tokenize the images and videos.
The whole pipeline follows the common practice in the multimodal LLMs, where the vision encoder is used to tokenize the images and videos, and then the visual tokens are fed into the LLM along with the textual tokens to generate the text outputs.
The model is developed by Microsoft and is funded by Microsoft Research. The model is shared by Microsoft Research and is licensed under the MIT License.
Intended Uses
This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI.