AI Engineering Framework

Hanzo Operative

A framework that enables multimodal AI models to operate a computer using the same inputs and outputs as a human operator, viewing the screen and executing mouse and keyboard actions to achieve objectives.

Screen Vision

Multimodal Models

Cursor Control

Keyboard Actions

Key Capabilities

Hanzo Operative gives AI models the ability to interact with computers the same way humans do

Cross-Platform

Compatible with Mac OS, Windows, and Linux with X server installed.

Self-Operating

Models can view the screen and decide on mouse and keyboard actions autonomously.

Secure Framework

Open-source implementation with transparent security practices.

Objective-Driven

Complete complex tasks based on natural language objectives.

OCR Integration

Optional OCR mode provides models with clickable element maps for enhanced accuracy.

Model Flexibility

Compatible with various multimodal models including GPT-4o, Claude 3, Gemini Pro Vision.

Open Source onGitHub

Getting Started with Hanzo Operative

Set up your environment in minutes and start using AI to operate your computer

Installation

1. Install Hanzo tools

curl -fsSL hanzo.sh | bash

2. Run the operative

operative

3. Enter your API key when prompted

System Requirements

  • • macOS, Windows, or Linux (with X server)
  • • Python 3.8 or higher
  • • 8GB RAM recommended
  • • Internet connection for API access

Basic Usage

Running with default settings (GPT-4o)

operative

Using voice input mode

operative --voice

Using OCR mode for enhanced element detection

operative -m gpt-4-with-ocr

Using Set-of-Mark (SoM) prompting

operative -m gpt-4-with-som

After running any of these commands, you'll be prompted to enter an objective for the AI to accomplish.

Supported Models

Hanzo Operative works with multiple multimodal AI models, each with different capabilities and strengths

Recommended

GPT-4o

by OpenAI

  • Highest accuracy
  • Fast response time
  • Best for complex tasks
  • Excellent UI understanding

operative

Claude 3

by Anthropic

  • Strong screen analysis
  • Detailed reasoning
  • Long context window
  • Good UI navigation

operative -m claude-3

Gemini Pro Vision

by Google

  • Good general performance
  • Robust screen analysis
  • Accessible API
  • Improving rapidly

operative -m gemini-pro-vision

Qwen-VL

by Alibaba Cloud

  • Strong visual capabilities
  • Growing feature set
  • Good for basic tasks
  • Alternative API option

operative -m qwen-vl

LLaVA

by Ollama (Local)

  • Runs locally
  • No API costs
  • Privacy-focused
  • Basic capabilities

operative -m llava

o1-with-ocr

by OpenAI (Experimental)

  • Advanced OCR
  • Element detection
  • Highest precision
  • Best for complex UIs

operative -m o1-with-ocr

New models are continuously being added. Check thedocumentation for the latest information.

See Operative in Action

Watch as Hanzo Operative uses multimodal AI to router interfaces, complete tasks, and solve problems autonomously.

Demo Video

Ready to Experience Self-Operating Computing?

Join the community of developers, researchers, and enthusiasts pioneering the future of human-AI collaboration.