GUI Automation Agent

Intelligent desktop automation agent based on OSWorld architecture, supporting vision-language model-based reasoning and precise GUI interaction.

System Overview
Architecture
1. System Architecture Diagram
2. OSWorld Core Concepts
Module Documentation
Quick Start
Supported Models
Related Resources

System Overview

Purpose

The GUI Automation Agent enables intelligent desktop task automation by combining vision-language models (VLMs) with precise GUI control. It can understand screen content, reason about tasks, and execute mouse/keyboard operations to complete complex workflows.

Key Features

👀 Intelligent Observation: Automatically capture and understand screen state
🧠 Visual Reasoning: VLM-based task understanding and decision making
🖱️ Precise Execution: Execute mouse and keyboard operations
🔄 Continuous Loop: Observe-Think-Act cycle until task completion
🛡️ Safety Isolation: VM mode support for host system protection

Technical Highlights

Feature	Description
Core Technology	VLM + Environment Control
Main Functionality	Desktop task automation
Input	Task instructions + Screenshots
Output	Automated operation sequences
Use Cases	RPA, UI testing, task execution
Deployment	Local / VM isolation

Architecture

System Architecture Diagram

graph TB
    subgraph "Task Layer"
        A[User Task Instruction] --> B[Task Parser]
        B --> C[Task Configuration]
    end

    subgraph "Agent Layer"
        C --> D[SimplePromptAgent]
        D --> D1[History Management]
        D --> D2[Prompt Building]
        D --> D3[Action Parsing]
    end

    subgraph "Model Layer"
        D2 --> E[Vision-Language Model]
        E --> E1[Qwen-VL]
        E --> E2[GPT-4V]
        E --> E3[QVQ]
        E1 --> F[Thinking Process]
        E2 --> F
        E3 --> F
        F --> G[Action Sequence]
    end

    subgraph "Environment Layer"
        G --> H[SimpleDesktopEnv]
        H --> H1[Local Controller]
        H --> H2[VM Controller]
        H1 --> I[PyAutoGUI]
        H2 --> J[Docker API]
    end

    subgraph "Observation Layer"
        I --> K[Screenshot]
        J --> K
        K --> L[Image Encoding]
        L --> D
    end

    subgraph "Execution Layer"
        I --> M[Local Actions]
        J --> N[VM Actions]
        M --> O[Mouse/Keyboard]
        N --> O
    end

    O --> P[Environment State Update]
    P --> K

OSWorld Core Concepts

Reference: OSWorld GitHub

GUI-Agent is built on OSWorld’s core architecture:

Environment Abstraction: SimpleDesktopEnv corresponds to OSWorld’s DesktopEnv
Agent Design: SimplePromptAgent corresponds to OSWorld’s PromptAgent
Observe-Act Loop: Screenshot → Model inference → Action execution → Repeat
Action Space: Uses PyAutoGUI commands (consistent with OSWorld)

Module Documentation

Environment Setup Configure VM or local desktop environment

VLM Integration Vision-language models for screenshot understanding

Task Execution Automated task execution workflow

Troubleshooting Common issues and solutions

Quick Start

Prerequisites

Environment Setup: Choose VM mode (recommended) or Local mode
VLM Configuration: Set API keys for Qwen-VL or GPT-4V
Permissions: Grant accessibility permissions (macOS) or start Docker (VM mode)

Basic Usage

Navigate to “🤖 GUI-Agent” tab
Choose environment (VM or Local)
Configure VLM model (Qwen-VL recommended)
Enter task instruction
Click “▶️ Execute Task”

Example Tasks

Simple Tasks:

“Open browser and visit google.com”
“Take a screenshot and save it”
“Open calculator and calculate 123 + 456”

Complex Tasks:

“Search for ‘Python tutorial’ on Google and open the first result”
“Create a new document, write ‘Hello World’, and save it”
“Find all image files in Downloads folder”

Supported Models

qwen3-vl-plus: Recommended, excellent Chinese support
qwen3-vl-flash: Faster response
gpt-4o: High precision
qvq-max: Complex reasoning

GUI Automation Agent

Table of contents

System Overview

Purpose

Key Features

Technical Highlights

Architecture

System Architecture Diagram

OSWorld Core Concepts

Module Documentation

Quick Start

Prerequisites

Basic Usage

Example Tasks

Supported Models

Table of contents

GUI Automation Agent

Table of contents

System Overview

Purpose

Key Features

Technical Highlights

Architecture

System Architecture Diagram

OSWorld Core Concepts

Module Documentation

Quick Start

Prerequisites

Basic Usage

Example Tasks

Supported Models

Related Resources

Table of contents