Jin
Unified multimodal LLM
Text, vision, audio, and 3D in a single model. Jin understands the world the way humans do — across modalities, in one shared latent space.
One model, every modality
No more pipelines. No more model-per-task. Jin reasons across modalities in a single forward pass.
Text reasoning
Long-context language understanding, code, math, and tool use — competitive with text-only frontier models.
Vision-language
Image captioning, visual QA, document understanding, and video analysis at frame and clip level.
Audio in and out
Speech recognition, speech generation, music understanding, and ambient sound classification.
3D understanding
Mesh, point cloud, and scene reasoning. Native input format support for industrial and creative workflows.
Shared latent space
Unified embedding across modalities means you can search images with audio, or generate 3D from text.
Single API
Send any combination of modalities in one request. Get any combination back. No glue code required.