GUI Agent Architecture
OSWorld-inspired architecture for vision-language model powered desktop automation.
System Overview
graph TB
A[Task Instruction] --> B[GUIAgent]
B --> C[Observation]
C --> D[VLM Reasoning
Qwen-VL]
D --> E[Action Decision]
E --> F{Action Type}
F -->|Mouse| G[PyAutoGUI]
F -->|Keyboard| H[PyAutoGUI]
F -->|Wait| I[Sleep]
G --> J[Execute Action]
H --> J
I --> J
J --> K[Screenshot]
K --> C
E -->|DONE| L[Task Complete]
Core Components
1. GUIAgent
- Manages observation-action loop
- Coordinates VLM and execution engine
- Tracks task progress
2. VLM Reasoning
- Model: Qwen-VL-Chat (7B)
- Input: Screenshot + task description
- Output: Structured action
3. Execution Engine
- PyAutoGUI: Mouse/keyboard control
- Platform: Ubuntu desktop (VM or native)
- Safety: Sandboxed execution
Observation-Action Loop
while not task_complete:
# 1. Observe
screenshot = capture_screen()
# 2. Reason
action = vlm.decide_action(screenshot, task, history)
# 3. Act
execute_action(action)
# 4. Check completion
if action['type'] == 'DONE':
break
Action Space
| Action | Parameters | Example |
|---|---|---|
| CLICK | x, y | {"type": "CLICK", "x": 500, "y": 300} |
| TYPE | text | {"type": "TYPE", "text": "hello"} |
| SCROLL | direction | {"type": "SCROLL", "direction": "down"} |
| WAIT | seconds | {"type": "WAIT", "seconds": 2} |
| DONE | - | {"type": "DONE"} |
Deployment Options
Local Mode
- Run on host machine
- Fast, no VM overhead
- ⚠️ Lower isolation
VM Mode (OSWorld)
- Ubuntu 22.04 in QEMU
- Complete isolation
- Screenshot via VNC