OmniParser-v2.0 Open-Source Screen Parsing Tool - Free Conversion of UI Screenshots to Structured Formats

Omniparser V2.0

Developed by microsoft

OmniParser is a universal screen parsing tool capable of interpreting/converting UI screenshots into structured formats to enhance LLM-based UI agent performance.

Image-to-Text

Transformers

Open Source License:MIT #UI Element Parsing #Low Latency Processing #Multimodal Agent

Downloads 6,729

Release Time : 2/12/2025

Model Overview

OmniParser is designed to transform unstructured screenshot images into structured element lists, including interactive area locations and potential functional descriptions of icons. It is suitable for various types of screenshots (including PC and mobile) and multiple application scenarios.

Model Features

Efficient Parsing

Compared to V1, latency is reduced by 60%, achieving 0.6 seconds per frame on A100 and 0.8 seconds on a single RTX 4090.

Large-scale Dataset

The training dataset includes interactive icon detection and icon description datasets, which are larger and cleaner.

Strong Performance

Achieves an average accuracy of 39.6 on ScreenSpot Pro.

Multi-model Support

Out-of-the-box support for various large language models such as OpenAI, DeepSeek, Qwen, or Anthropic Computer Use.

Model Capabilities

UI Screenshot Parsing

Interactive Area Detection

Icon Function Description

Structured Data Conversion

Use Cases

UI Agent Development

LLM-based GUI Agent

Control a Windows 11 virtual machine using OmniParser + a chosen vision model.

Enhances the agent's understanding and operational capabilities for UIs.

Automated Testing

UI Element Detection

Automatically detect and describe interactive elements in applications.

Improves test coverage and efficiency.

🚀 OmniParser

OmniParser is a general screen parsing tool that interprets and converts UI screenshots into a structured format. It aims to enhance existing LLM - based UI agents, offering a practical solution for handling unstructured screenshot data.

📢 [GitHub Repo] [OmniParser V2 Blog Post] Huggingface demo

🚀 Quick Start

The model hub includes a finetuned version of YOLOv8 and a finetuned Florence - 2 base model. For more details of the models used and finetuning, please refer to the paper.

✨ Features

Model Summary

OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include:

An interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions.
An icon description dataset, designed to associate each UI element with its corresponding function.

What's new in V2?

Larger and cleaner set of icon caption + grounding dataset.
60% improvement in latency compared to V1. Avg latency: 0.6s/frame on A100, 0.8s on single 4090.
Strong performance: 39.6 average accuracy on ScreenSpot Pro.
Your agent only need one tool: OmniTool. Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3 - mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Check out our github repo for details.

Responsible AI Considerations

Intended Use

OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality.
OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser.
OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications.

limitations

OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful.
While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard.

📄 License

Please note that icon_detect model is under AGPL license, and icon_caption is under MIT license. Please refer to the LICENSE file in the folder of each model.

Property	Details
Library Name	transformers
Model Type	YOLOv8 (finetuned), Florence - 2 (finetuned)
Training Data	Interactable icon detection dataset, Icon description dataset
License	icon_detect: AGPL; icon_caption: MIT
Tags	endpoint - template, custom_code

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご