hanzoai/jin

Jin

Unified multimodal LLM

Text, vision, audio, and 3D in a single model. Jin understands the world the way humans do — across modalities, in one shared latent space.

Text

Reasoning

Vision

Image + video

Audio

Speech + music

Scene + mesh

One model, every modality

No more pipelines. No more model-per-task. Jin reasons across modalities in a single forward pass.

Long-context language understanding, code, math, and tool use — competitive with text-only frontier models.

Image captioning, visual QA, document understanding, and video analysis at frame and clip level.

Speech recognition, speech generation, music understanding, and ambient sound classification.

Mesh, point cloud, and scene reasoning. Native input format support for industrial and creative workflows.

Unified embedding across modalities means you can search images with audio, or generate 3D from text.

Send any combination of modalities in one request. Get any combination back. No glue code required.

License: Apache-2.0hanzoai/jin

Multimodal foundation model