Saturday, September 20, 2025
HomeTechnologiesAI2's 3D-Capable MolmoAct Challenges Nvidia and Google with 95% Success Rate in...

AI2’s 3D-Capable MolmoAct Challenges Nvidia and Google with 95% Success Rate in Complex Robotic Tasks: Research Breakthrough

Are you looking for smarter insights delivered directly to your inbox? Sign up for our weekly newsletters, tailored specifically for enterprise leaders in AI, data, and security. Subscribe now!

The Rise of Physical AI

The field of Physical AI, where robotics and foundational models converge, is rapidly expanding. Companies such as Nvidia, Google, and Meta are actively conducting research and experimenting with the integration of large language models (LLMs) and robotics. Recently, the Allen Institute for AI (Ai2) has introduced a new open-source model called MolmoAct 7B, designed to challenge industry giants like Nvidia and Google in the realm of Physical AI. This innovative model enables robots to “reason in space,” enhancing their interaction with the physical world.

Understanding MolmoAct 7B

MolmoAct is built upon Ai2’s open-source Molmo framework and is capable of “thinking” in three dimensions. Alongside the model, Ai2 is also releasing its training data, with an Apache 2.0 license for the model and CC BY-4.0 licensing for the datasets. Ai2 categorizes MolmoAct as an Action Reasoning Model, which allows foundational models to reason about actions in a physical, 3D environment. This capability enables MolmoAct to comprehend its surroundings, plan spatial occupancy, and execute actions accordingly.

Overcoming Challenges in Robotics

According to Ai2, MolmoAct surpasses traditional vision-language-action (VLA) models by incorporating reasoning capabilities in 3D space. Most existing robotics models rely on VLAs, which lack spatial reasoning, whereas MolmoAct’s advanced architecture enhances its performance and generalizability. Given that robots operate in the physical world, Ai2 asserts that MolmoAct empowers them to better perceive their environment and make informed decisions about interactions.

The company envisions applications for MolmoAct in various settings, particularly in homes, where irregular and constantly changing conditions present significant challenges for robotics.

Innovative Spatial Understanding

MolmoAct achieves its understanding of the physical world through the use of “spatially grounded perception tokens.” These tokens are pre-trained and extracted using a vector-quantized variational autoencoder, which transforms data inputs (like video) into tokens. Unlike those used by VLAs, these tokens are not text-based, enabling MolmoAct to develop spatial awareness and encode geometric structures. This allows the model to estimate distances between objects and predict a series of “image-space” waypoints, guiding its path planning.

Once the distances are estimated, MolmoAct can output specific actions, such as adjusting the position of a robotic arm or stretching. Ai2’s researchers have noted that the model can adapt to various embodiments—whether a mechanical arm or a humanoid robot—with minimal fine-tuning. Benchmark testing indicates that MolmoAct 7B achieved a task success rate of 72.1%, outperforming models from competitors like Google, Microsoft, and Nvidia.

The Future of AI and Robotics

Ai2’s research represents a significant advancement in leveraging the unique advantages of LLMs and vision-language models (VLMs), particularly as innovation in generative AI accelerates. Experts view the work from Ai2 and other tech firms as foundational for future developments. Alan Fern, a professor at Oregon State University, remarked that Ai2’s research signifies a natural progression in enhancing VLMs for robotics and physical reasoning. While he considers it an important step forward, he also notes that current benchmarks do not fully capture the complexities of real-world scenarios.

Daniel Maturana, co-founder of Gather AI, commended the openness of the data, emphasizing its value for academic labs and dedicated hobbyists. The aspiration to create more intelligent and spatially aware robots has long been a goal for developers and computer scientists alike.

Top Infos

Favorites