TwelveLabs Marengo 3.0
The most powerful embedding model for video understanding
TwelveLabs Marengo 3.0 – Advanced multimodal embedding model for scalable video understanding
Summary: Marengo 3.0 is a multimodal embedding model that integrates video, audio, and text to enable precise video search and retrieval. It supports long-form, multilingual, and noisy real-world content, delivering state-of-the-art results across diverse video understanding tasks while being storage-efficient and production-ready.
What it does
Marengo 3.0 creates a unified embedding space for video, audio, text, images, and composed queries, enabling action-level sports retrieval, long descriptive queries, and multilingual search across 36 languages. It processes complex, mixed-modality inputs efficiently and accurately.
Who it's for
It targets developers and organizations needing scalable, real-world video understanding for long, multilingual, and multimodal content in production environments.
Why it matters
Marengo 3.0 addresses limitations of existing models by handling long videos, noisy audio, and multilingual data without sacrificing performance or efficiency.