Microsoft OmniParser Open Source: Screen Parsing Surpasses GPT-4V, Revolutionizing PC Control Agents!

13 min read

Microsoft OmniParser Open-Sourced: Screen Parsing Surpasses GPT-4V, Enhancing Computer Control Agents!

Since Anthropic's late-night release of a significant update, introducing the Super Agent: computer use, there has been a surge in the trend of "controlling computers like humans," with companies like Zhipu also launching AutoGLM, which automates computer and phone operations with a single command.

Microsoft OmniParser

Super Agent: Controlling Computers Like Humans!

Recently, Microsoft open-sourced OmniParser, specifically designed for parsing computer and mobile phone screen UIs. It is claimed to outperform GPT-4V in relevant screen understanding benchmarks.

OmniParser is a universal screen parsing tool that interprets/converts screenshots of user interfaces (UI) into structured formats to enhance the performance of UI agents based on existing large language models (LLM). The training dataset includes:

  • An interactive icon detection dataset, collected from popular web pages and automatically labeled to highlight clickable and actionable areas.
  • An icon description dataset, aimed at associating each UI element with its corresponding function.
    The model hub includes a fine-tuned YOLOv8 version and a fine-tuned BLIP-2 model on the aforementioned datasets.

Example of OmniParser's parsed screenshot images and local semantics. The input to OmniParse is a user task and a UI screenshot, which generates: 1) a parsed screenshot image with overlaid bounding boxes and numeric IDs, and 2) local semantics containing extracted text and icon descriptions.

OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark. On the Mind2Web and AITW benchmarks, OMNIPARSER, using only screenshot inputs, outperforms the GPT-4V baseline that requires additional information beyond screenshots.

Example from the SeeAssign evaluation. It shows that fine-grained local semantics enhance GPT-4V's ability to assign correct labels to the referenced icons.

OmniParser for Pure Vision Based GUI Agent
https://arxiv.org/abs/2408.00203
https://github.com/microsoft/OmniParser