hanzoai/jin

Jin

Unified multimodal LLM

Text, vision, audio, and 3D in a single model. Jin understands the world the way humans do — across modalities, in one shared latent space.

Text
Reasoning
Vision
Image + video
Audio
Speech + music
3D
Scene + mesh

One model, every modality

No more pipelines. No more model-per-task. Jin reasons across modalities in a single forward pass.

Text reasoning

Long-context language understanding, code, math, and tool use — competitive with text-only frontier models.

Vision-language

Image captioning, visual QA, document understanding, and video analysis at frame and clip level.

Audio in and out

Speech recognition, speech generation, music understanding, and ambient sound classification.

3D understanding

Mesh, point cloud, and scene reasoning. Native input format support for industrial and creative workflows.

Shared latent space

Unified embedding across modalities means you can search images with audio, or generate 3D from text.

Single API

Send any combination of modalities in one request. Get any combination back. No glue code required.

Get started with Jin

Open source

License: Apache-2.0hanzoai/jin

Get Jin

Multimodal foundation model