Magma-8B Open-Source Multimodal AI Model - Free Processing of Text and Image Inputs and Realization of Complex Interactions

Magma 8B

Developed by microsoft

Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.

Image-to-Text

Transformers

Open Source License:MIT #Multimodal Agent #Vision-Language Interaction #Robot Control

Downloads 4,526

Release Time : 2/23/2025

Model Overview

Magma is a foundational model for multimodal AI agents. By introducing token sets and token trajectory techniques, it learns spatiotemporal localization and planning capabilities from vast amounts of unlabeled video data, making it suitable for various intelligent tasks such as UI navigation and robotic manipulation.

Model Features

Digital & Physical World Interaction

The first multimodal AI agent model capable of handling complex interactions in both virtual and real-world environments.

Versatile Unified Architecture

A single model with integrated capabilities for visual understanding, language generation, and action planning.

Spatiotemporal Localization & Planning

Learns spatiotemporal localization through token trajectory techniques from video data.

Scalable Pretraining

Can extend learning from massive unlabeled video data, demonstrating strong generalization capabilities.

Model Capabilities

Image understanding

Video understanding

Text generation

UI navigation

Robotic manipulation control

Game control

Spatial reasoning

Multimodal interaction

Use Cases

Smart Device Interaction

Mobile UI Navigation

Automatically operates smartphone interfaces based on voice commands

Successfully demonstrated weather queries and airplane mode settings in demos

Robot Control

Object Grasping

Controls robots to grasp specific objects based on visual input

Successfully grasped hot dogs and mushrooms in demonstrations

Game AI

Game Control

Understands game states through visual input and generates control commands

Outperformed LLaVA and GPT4o-mini in green square collection tasks

🚀 Magma-8B: A Foundation Model for Multimodal AI Agents

Magma-8B is a multimodal agentic AI model. It can generate text based on input text and images. The model is developed for research, aiming to share knowledge and speed up research in multimodal AI.

🚀 Quick Start

To start using the model, first ensure that transformers and torch are installed, and also install the following dependencies:

pip install torchvision Pillow open_clip_torch

⚠️ Important Note

You need to install our customized transformers lib:

pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2

See here for the reason why you need this.

Then you can run the following code:

import torch
from PIL import Image
from io import BytesIO
import requests

from transformers import AutoModelForCausalLM, AutoProcessor

# Load the model and processor
dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")

# Inference
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(BytesIO(requests.get(url, stream=True).content))
image = image.convert("RGB")

convs = [
    {"role": "system", "content": "You are agent that can see, talk and act."},
    {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(dtype)

generation_args = { 
    "max_new_tokens": 128, 
    "temperature": 0.0, 
    "do_sample": False, 
    "use_cache": True,
    "num_beams": 1,
}

with torch.inference_mode():
    generate_ids = model.generate(**inputs, **generation_args)

generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
print(response)

✨ Features

Agents

UI Navigation

What's weather in Seattle? & turn on flight mode

Share and message this to Bob Steve. Click send button

Robot Manipulation

Pick Place Hotdog Sausage

Put Mushroom Place Pot

Push Cloth Left to Right (Out-of-Dist.)

Gaming

Task: Model controls the robot to collect green blocks.

Magma v.s. LLaVA-OneVision

Magma v.s. GPT4o-minni

Model Details

Model Description

Magma is a multimodal agentic AI model that can generate text based on the input text and image. The model is designed for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, in particular the multimodal agentic AI. The main innovation of this model lies on the introduction of two technical innovations: Set-of-Mark and Trace-of-Mark, and the leverage of a large amount of unlabeled video data to learn the spatial-temporal grounding and planning. Please refer to our paper for more technical details.

Highlights

Digital and Physical Worlds: Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
Versatile Capabilities: Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
State-of-the-art Performance: Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
Scalable Pretraining Strategy: Magma is designed to be learned scalably from unlabeled videos in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!

📚 Documentation

Training Details

Training Data

Our training data consists of:

Generic Image SFT Data: LLaVA-Next, InfoGrpahicVQA, ChartQA_Augmented, FigureQA, TQA, ScienceQA.
Generic Video SFT Data: ShareGPT4Video and LLaVA-Video.
Instructional Video Data: Ego4d, Somethingv2, Epic-Kitchen and other related instructional videos.
Robotics Manipulation Data: Open-X-Embodiment.
UI Grounding Data: SeeClick.
UI Navigation Data: Mind2web and AITW.

The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.

More details can be found in our paper.

Microsoft Privacy Notice

Training Procedure

Preprocessing

In addition to the text-related preprocessing, we mainly undertake the following image and video preprocessing steps:

UI Grounding and Navigation Data: For each UI screenshot, we extract the bounding boxes for the UI elements, and apply Set-of-Mark Prompting to overlay numeric marks on the raw image. The model is trained to generate the UI grounding text based on the image and the Set-of-Mark prompts.
Instruction Video Data: For each video clip, we apply Co-Tracker to extract the grid traces and then apply filtering algorithm to remove the noisy or static points. For videos that bear camera motion, we further apply homography transformation to stabilize the video clips. In the end, we assign a numeric mark for each trace which gives us a set of trace-of-mark. The model is trained to generate the trace-of-mark given the video clips and instructional text.
Robotics Manipulation Data: For robotics data in Open-X Embodiment, we extract the 7 DoF robot gripper state and also extract the trace-of-mark from the video clips. Similar filtering and stabilization steps are applied to the video clips. The model is trained to generate the robot manipulation action as well as the trace-of-mark given the video clips and instructional text.

After all these preprocessing, we combine them with existing text annotations to form our final multimodal training data. We refer to our paper for more technical details.

Training Hyperparameters

We used bf16 mixed precision for training on H100s and MI300s. We used the following hyperparameters for training:

Batch size: 1024
Learning rate: 1e-5
Max sequence length: 4096
Resolution: maximally 1024x1024 for image, 512x512 for video frame.
Pretraining Epochs: 3

Evaluation

Testing Data, Factors & Metrics

Zero-shot Testing Data

We evaluate the model's zero-shot performance on the following datasets:

UI Grounding: ScreenSpot and VisualWebArena.
Robotics Manipulation: SimplerEnv and WidowX real robot.
Spatial Understanding and Reasoning: VSR, BLINK and SpatialEval.

Finetuned Testing Data

We evaluate the model's performance after finetuning on the following datasets:

UI Navigation: Mind2Web and AITW.
Robotics Manipulation: SimplerEnv and WidowX real robot.
Multimodal Image Understanding and Reasoning: VQAv2, GQA, MME, POPE, TextVQA, ChartQA, DocVQA.
Multimodal Video Understanding and Reasoning: Next-QA, VideoMME, MVBench.

Metrics

We follow the individual dataset's evaluation metrics for the evaluation. Please refer to the original dataset for more details.

Results on Agentic Intelligence

Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.

Model	VQAv2	TextVQA	POPE	SS-Mobile	SS-Desktop	SS-Web	VWB-Ele-G	VWB-Act-G	SE-Google Robot	SE-Bridge
GPT-4V	77.2	78.0	n/a	23.6	16.0	9.0	67.5	75.7	-	-
GPT-4V-OmniParser	n/a	n/a	n/a	71.1	45.6	58.5	-	-	-	-
LLava-1.5	78.5	58.2	85.9	-	-	-	12.1	13.6	-	-
LLava-Next	81.3	64.9	86.5	-	-	-	15.0	8.7	-	-
Qwen-VL	78.8	63.8	n/a	6.2	6.3	3.0	14.0	0.7	-	-
Qwen-VL-Chat	78.2	61.5	n/a	-	-	-	-	-	-	-
Fuyu	74.2	n/a	n/a	21.2	20.8	19.2	19.4	15.5	-	-
SeeClick	-	-	-	65.0	51.1	44.1	9.9	1.9	-	-
Octo	-	-	-	-	-	-	-	-	-	-
RT-1-X	-	-	-	-	-	-	-	-	6.0	15.9
OpenVLA	-	-	-	-	-	-	-	-	34.2	1.1
Magma-8B	80.0	66.5	87.4	59.5	64.1	60.6	96.3	71.8	52.3	35.4

Notes: SS - ScreenSpot, VWB - VisualWebArena, SE - SimplerEnv

Technical Specifications

Model Architecture and Objective

Language Model: We use Meta LLama-3 as the backbone LLM.
Vision Encoder: We use CLIP-ConvneXt-XXLarge trained by LAION team as the vision encoder to tokenize the images and videos.

The whole pipeline follows the common practice in the multimodal LLMs, where the vision encoder is used to tokenize the images and videos, and then the visual tokens are fed into the LLM along with the textual tokens to generate the text outputs.

Compute Infrastructure

We used Azure ML for our model training.

Hardware

Our model is trained on two GPUs:

Nvidia H100
AMD MI300

Software

Our model is built based on:

📄 License

The model is developed by Microsoft and is funded by Microsoft Research. The model is shared by Microsoft Research and is licensed under the MIT License.

Intended Uses

This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご