Hanzo Engine
Rust-native LLM and embedding inference engine
High-throughput, low-latency inference for transformer LLMs and embedding models. Built in Rust for memory safety, predictable performance, and deployment to any hardware target.
Inference without the overhead
One engine for LLM generation and embedding generation. No Python runtime, no GIL, no cold-start tax.
Rust-native
Pure Rust runtime. No Python interpreter, no FFI overhead. Predictable latency under load.
High throughput
Continuous batching, paged attention, and quantized kernels. Saturate your GPU or CPU.
LLM + embeddings
One binary serves both generation and embedding workloads. Share memory, share weights.
Quantization built-in
GGUF, AWQ, and GPTQ formats supported out of the box. Run 70B models on consumer GPUs.
Any hardware
CUDA, Metal, ROCm, and CPU backends. Same engine, same API, every target.
Memory safe
Rust's ownership model eliminates entire classes of CVEs. Production-grade reliability.