Microsoft OmniParser Open Source: Screen Parsing Surpasses GPT-4V, Revolutionizing PC Control Agents!

Microsoft OmniParser Open-Sourced: Screen Parsing Surpasses GPT-4V, Enhancing Computer Control Agents!

Since Anthropic's late-night release of a significant update, introducing the Super Agent: computer use, there has been a surge in the trend of "controlling computers like humans," with companies like Zhipu also launching AutoGLM, which automates computer and phone operations with a single command.

Microsoft OmniParser

Super Agent: Controlling Computers Like Humans!

Recently, Microsoft open-sourced OmniParser, specifically designed for parsing computer and mobile phone screen UIs. It is claimed to outperform GPT-4V in relevant screen understanding benchmarks.

OmniParser is a universal screen parsing tool that interprets/converts screenshots of user interfaces (UI) into structured formats to enhance the performance of UI agents based on existing large language models (LLM). The training dataset includes:

An interactive icon detection dataset, collected from popular web pages and automatically labeled to highlight clickable and actionable areas.
An icon description dataset, aimed at associating each UI element with its corresponding function.
The model hub includes a fine-tuned YOLOv8 version and a fine-tuned BLIP-2 model on the aforementioned datasets.

Example of OmniParser's parsed screenshot images and local semantics. The input to OmniParse is a user task and a UI screenshot, which generates: 1) a parsed screenshot image with overlaid bounding boxes and numeric IDs, and 2) local semantics containing extracted text and icon descriptions.

OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark. On the Mind2Web and AITW benchmarks, OMNIPARSER, using only screenshot inputs, outperforms the GPT-4V baseline that requires additional information beyond screenshots.

Example from the SeeAssign evaluation. It shows that fine-grained local semantics enhance GPT-4V's ability to assign correct labels to the referenced icons.

OmniParser for Pure Vision Based GUI Agent
https://arxiv.org/abs/2408.00203
https://github.com/microsoft/OmniParser