Ai2 says its Molmo 2 multimodal AI model can do more with less data
Ai2 said Molmo 2 improves on its earlier models despite its compact size. | Source: Ai2
The Allen Institute for AI, also known as Ai2, last week released Molmo 2, its latest multimodel suite capable of precise spatial and temporal understanding of video, image, and multi-image sets. Building on the first Molmo platform, Molmo 2 has advanced capabilities in video pointing, multi-frame reasoning, and object tracking.
Molmo 2 is an 8B-parameter model that surpasses last year’s 72B-parameter Molmo in accuracy, temporal understanding, and pixel-level grounding. Ai2 said it also bests proprietary models like Gemini 3 on key emerging skills like video tracking.
When it comes to image and multi-image reasoning, Ai2 claimed the Molmo 2 4B variant outperforms open models such as Qwen 3-VL-8B while using fewer parameters. Skills like these help the model, and any application or system built on top of it, to understand what is happening, where it is happening, and what it means.
Molmo 2 is also trained on far less data than similar models — 9.19 million videos compared with 72.5 million for Meta’s PerceptionLM.
“With a fraction of the data, Molmo 2 surpasses many frontier models on key video understanding tasks,” said Ali Farhadi, the CEO of Ai2. ‘We are excited to see the immense impact this model will have on the AI landscape, adding another piece to our fully open model ecosystem.”
Ai2 is a Seattle-based nonprofit AI research institute with the mission of building AI to solve the world’s biggest problems. Founded in 2014 by late Microsoft co-founder Paul G. Allen, Ai2 said it develops foundational AI research and new applications through large-scale open models, open data, robotics, conservation platforms, and more.
Molmo 2 offers new capabilities
Deep video understanding is key to building models that can understand and act on sensor streams for robotics. However, most models today either lack video understanding capabilities or are locked behind proprietary systems without transparency into the data. Ai2 said it is giving researchers access to advanced video grounding, tracking, and multi-frame reasoning, all with open weights and data.
Molmo 2 can identify exactly where and when events occur, track multiple objects through complex scenes, and connect actions to frame-level timelines. The company said these capabilities support safer automation, more accurate real-world systems, and open research the global community can inspect, reproduce, and build upon.
Ai2 listed key capabilities:
Frame-level spatial and temporal grounding: Molmo 2 goes beyond description. It returns precise pixel coordinates, object positions, and timestamps for events across a video.
Robust multi-object tracking and counting: The model maintains consistent object identities across occlusions, scene changes, and long clips, enabling applications in robotics, inspection, transportation, and industry.
Dense long-form video captioning and anomaly detection: Molmo 2 produces highly detailed, searchable descriptions and flags unusual events in long sequences.
Molmo 2 delivers on major open-weight benchmarks, says Ai2
Molmo 2 delivers results on major open-weight benchmarks and is on par with leading proprietary systems on real-world video tasks. The model meets leading open-weight performance on short-video understanding benchmarks such as MVBench, MotionQA, and NextQA.
It offers improvements in video grounding accuracy, often doubling or tripling the scores of previous open models and surpassing proprietary APIs on several pointing and counting tasks, Ai2 claimed. The model also offers tracking results across multi-domain benchmarks, outperforming strong open baselines and several commercial closed models.
In addition, Molmo 2 features image and multi-image reasoning that rivals or exceeds larger open-weight systems despite using fewer parameters. Ai2 asserted that human preference evaluations showed that Molmo 2 is on par with or better than multiple proprietary systems on real-world video QA and captioning tasks.
Ai2 offers open data and recipes
For transparency and reproducibility, all the training sources for Molmo 2 are provided in the technical report. Ai2 is also releasing a collection of nine new open datasets used to train Molmo 2, totaling more than 9 million multimodal examples across dense video captions, long-form QA, grounding, tracking, and multi-image reasoning.
The captioning corpus alone spans more than 100,000 videos with detailed descriptions that average more than 900 words each. The data mix covers video pointing, multi-object tracking, synthetic grounding, and long-video reasoning. Together, they form one of the most complete open video data collections available today, claimed Ai2.
Molmo 2 comes in three main variants: Molmo 2 (4B), Molmo2 (8B), and Molmo 2-O (7B), which uses Ai2’s fully open Olmo backbone for the complete end-to-end model flow. Versions tuned specifically for pointing and tracking are also available.
All models, datasets, and evaluation tools are now publicly available on GitHub, Hugging Face, and the Ai2 Playground for interactive testing. The company plans to release the training code soon.

Responses