Microsoft OmniParser Open-Sourced: Screen Parsing Surpasses GPT-4V, Enhancing Computer Control Agents!
Since Anthropic's late-night release of a significant update, introducing the Super Agent: computer use, there has been a surge in the trend of "controlling computers like humans," with companies like Zhipu also launching AutoGLM, which automates computer and phone operations with a single command.
Super Agent: Controlling Computers Like Humans!
Recently, Microsoft open-sourced OmniParser, specifically designed for parsing computer and mobile phone screen UIs. It is claimed to outperform GPT-4V in relevant screen understanding benchmarks.
OmniParser is a universal screen parsing tool that interprets/converts screenshots of user interfaces (UI) into structured formats to enhance the performance of UI agents based on existing large language models (LLM). The training dataset includes:
- An interactive icon detection dataset, collected from popular web pages and automatically labeled to highlight clickable and actionable areas.
- An icon description dataset, aimed at associating each UI element with its corresponding function.
The model hub includes a fine-tuned YOLOv8 version and a fine-tuned BLIP-2 model on the aforementioned datasets.
Example of OmniParser's parsed screenshot images and local semantics. The input to OmniParse is a user task and a UI screenshot, which generates: 1) a parsed screenshot image with overlaid bounding boxes and numeric IDs, and 2) local semantics containing extracted text and icon descriptions.
OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark. On the Mind2Web and AITW benchmarks, OMNIPARSER, using only screenshot inputs, outperforms the GPT-4V baseline that requires additional information beyond screenshots.
Example from the SeeAssign evaluation. It shows that fine-grained local semantics enhance GPT-4V's ability to assign correct labels to the referenced icons.
OmniParser for Pure Vision Based GUI Agent
https://arxiv.org/abs/2408.00203
https://github.com/microsoft/OmniParser