OmniParser: Open-Source Screen Parsing by Microsoft Research

OmniParser is an open-source screen parsing tool developed by Microsoft Research and Microsoft Gen AI. It converts user interface screenshots into structured, easy-to-understand elements, enabling large language models and vision-language models to serve as GUI automation agents. By bridging the gap between raw pixel data and structured element representations, OmniParser helps AI systems reliably identify interactable icons, understand the semantics of screen elements, and ground actions accurately to the correct regions on the interface.

How It Works

OmniParser operates in two stages. First, an interactable region detection model identifies clickable and actionable areas on the screen, using bounding boxes derived from DOM tree annotations. This detection model is a fine-tuned version of YOLOv8 trained on a curated dataset of 67,000 unique screenshot images from popular webpages. Second, a caption model based on BLIP-2 or Florence extracts the functional semantics of each detected element, describing what each icon or UI component does. The structured output includes both the location coordinates and a textual description of each interactable element on the screen.

Key Capabilities

Converts any UI screenshot into structured JSON-like element lists with bounding boxes and functional descriptions
Works across operating systems and applications, including PC, phone, and web interfaces
Detects even fine-grained and small icons with high accuracy, especially in OmniParser V2
Reduces latency by 60 percent in V2 compared to the previous version through optimized icon caption model image sizes
Achieves state-of-the-art accuracy on screen grounding benchmarks including ScreenSpot Pro

Use Cases

GUI automation agents that browse web pages, fill forms, and navigate desktop applications
Training data pipeline creation for domain-specific agent development using logged OmniParser trajectories
Building AI assistants that can understand and act upon visual interfaces without needing DOM or accessibility tree access
Research into vision-based agent systems and multimodal interaction benchmarks

Integration and Ecosystem

OmniParser is designed as a plugin component that works with multiple vision-language models, including GPT-4o, GPT-4V, Phi-3.5-V, Llama-3.2-V, DeepSeek R1, Qwen 2.5 VL, and Anthropic Claude. Microsoft provides OmniTool, a Dockerized Windows 11 environment that bundles OmniParser with essential agent tools for out-of-the-box experimentation. A Hugging Face Space demo is available for trying OmniParser interactively. The model checkpoints are hosted on the Hugging Face model hub under mixed licensing: the icon detection model uses AGPL, while the icon caption models use MIT.