AI Engineering Framework

Hanzo Operative

A framework that enables multimodal AI models to operate a computer using the same inputs and outputs as a human operator, viewing the screen and executing mouse and keyboard actions to achieve objectives.

Screen Vision

Multimodal Models

Cursor Control

Keyboard Actions

Key Capabilities

Hanzo Operative gives AI models the ability to interact with computers the same way humans do

Cross-Platform

Compatible with Mac OS, Windows, and Linux with X server installed.

Self-Operating

Models can view the screen and decide on mouse and keyboard actions autonomously.

Secure Framework

Open-source implementation with transparent security practices.

Objective-Driven

Complete complex tasks based on natural language objectives.

OCR Integration

Optional OCR mode provides models with clickable element maps for enhanced accuracy.

Model Flexibility

Compatible with various multimodal models including GPT-4o, Claude, Gemini 3.1 Pro.

Open Source onGitHub

Getting Started with Hanzo Operative

Set up your environment in minutes and start using AI to operate your computer

Installation

1. Install Hanzo tools

curl -fsSL hanzo.sh | bash

2. Run the operative

operative

3. Enter your API key when prompted

System Requirements

• macOS, Windows, or Linux (with X server)
• Python 3.8 or higher
• 8GB RAM recommended
• Internet connection for API access

Basic Usage

Running with default settings (GPT-5.3)

operative

Using voice input mode

operative --voice

Using OCR mode for enhanced element detection

operative -m gpt-5.3-with-ocr

Using Set-of-Mark (SoM) prompting

operative -m gpt-5.3-with-som

After running any of these commands, you'll be prompted to enter an objective for the AI to accomplish.

Supported Models

Hanzo Operative works with multiple multimodal AI models, each with different capabilities and strengths

Recommended

GPT-4o

by OpenAI

Highest accuracy
Fast response time
Best for complex tasks
Excellent UI understanding

operative

Claude

by Anthropic

Strong screen analysis
Detailed reasoning
Long context window
Good UI navigation

operative -m claude

Gemini 3.1 Pro

by Google

Good general performance
Reliable screen analysis
Accessible API
Improving rapidly

operative -m gemini-3.1-pro

zen3-vl

by Hanzo Cloud

Strong visual capabilities
30B MoE architecture
Good for basic tasks
Native API support

operative -m zen3-vl

LLaVA

by Ollama (Local)

Runs locally
No API costs
Privacy-focused
Basic capabilities

operative -m llava

o3-with-ocr

by OpenAI (Experimental)

Advanced OCR
Element detection
Highest precision
Best for complex UIs

operative -m o3-with-ocr

New models are continuously being added. Check thedocumentation for the latest information.

See Operative in Action

Watch as Hanzo Operative uses multimodal AI to router interfaces, complete tasks, and solve problems autonomously.

Demo Video

View more examples →

Open Source Revenue Sharing

25% of compute goes back to open source

Every deployment is SBOM-verified. Contributors to Anthropic Computer Use earn a share of compute revenue — transparent, on-chain, and customizable by the community.

Connect & Earn Learn More

Ready to Experience Self-Operating Computing?

Join the community of developers, researchers, and enthusiasts pioneering the future of human-AI collaboration.

Read the Blog Follow Updates Contribute

Get started with Operative

Read the docs View on GitHub