Hanzo Operative
A framework that enables multimodal AI models to operate a computer using the same inputs and outputs as a human operator, viewing the screen and executing mouse and keyboard actions to achieve objectives.
Screen Vision
Multimodal Models
Cursor Control
Keyboard Actions
Key Capabilities
Hanzo Operative gives AI models the ability to interact with computers the same way humans do
Cross-Platform
Compatible with Mac OS, Windows, and Linux with X server installed.
Self-Operating
Models can view the screen and decide on mouse and keyboard actions autonomously.
Secure Framework
Open-source implementation with transparent security practices.
Objective-Driven
Complete complex tasks based on natural language objectives.
OCR Integration
Optional OCR mode provides models with clickable element maps for enhanced accuracy.
Model Flexibility
Compatible with various multimodal models including GPT-4o, Claude 3, Gemini Pro Vision.
Getting Started with Hanzo Operative
Set up your environment in minutes and start using AI to operate your computer
Installation
1. Install Hanzo tools
2. Run the operative
3. Enter your API key when prompted
System Requirements
- • macOS, Windows, or Linux (with X server)
- • Python 3.8 or higher
- • 8GB RAM recommended
- • Internet connection for API access
Basic Usage
Running with default settings (GPT-4o)
Using voice input mode
Using OCR mode for enhanced element detection
Using Set-of-Mark (SoM) prompting
After running any of these commands, you'll be prompted to enter an objective for the AI to accomplish.
Supported Models
Hanzo Operative works with multiple multimodal AI models, each with different capabilities and strengths
GPT-4o
by OpenAI
- Highest accuracy
- Fast response time
- Best for complex tasks
- Excellent UI understanding
operative
Claude 3
by Anthropic
- Strong screen analysis
- Detailed reasoning
- Long context window
- Good UI navigation
operative -m claude-3
Gemini Pro Vision
by Google
- Good general performance
- Robust screen analysis
- Accessible API
- Improving rapidly
operative -m gemini-pro-vision
Qwen-VL
by Alibaba Cloud
- Strong visual capabilities
- Growing feature set
- Good for basic tasks
- Alternative API option
operative -m qwen-vl
LLaVA
by Ollama (Local)
- Runs locally
- No API costs
- Privacy-focused
- Basic capabilities
operative -m llava
o1-with-ocr
by OpenAI (Experimental)
- Advanced OCR
- Element detection
- Highest precision
- Best for complex UIs
operative -m o1-with-ocr
New models are continuously being added. Check thedocumentation for the latest information.
See Operative in Action
Watch as Hanzo Operative uses multimodal AI to router interfaces, complete tasks, and solve problems autonomously.
Demo Video
Ready to Experience Self-Operating Computing?
Join the community of developers, researchers, and enthusiasts pioneering the future of human-AI collaboration.