Are you looking for smarter insights delivered straight to your inbox? Subscribe to our weekly newsletters for essential updates tailored for leaders in enterprise AI, data, and security.
Introducing Hermes 4
Nous Research, an enigmatic artificial intelligence startup, has emerged as a prominent voice in the open-source AI movement. On Monday, the company quietly launched Hermes 4, a series of large language models that it claims can rival the performance of leading proprietary systems while providing exceptional user control and minimal content restrictions. This release marks a significant escalation in the ongoing rivalry between open-source AI advocates and major tech companies regarding access to advanced AI capabilities.
A Shift in AI Design
Unlike models from OpenAI, Google, or Anthropic, Hermes 4 is uniquely designed to respond to nearly any request without the safety guardrails that have become commonplace in commercial AI systems. Nous Research describes Hermes 4 as the latest iteration of its user-aligned models, featuring enhanced test-time compute capabilities. The company emphasized that considerable attention was devoted to ensuring the models are creative and engaging while remaining free from censorship and neutral in alignment, all while maintaining top-tier performance in mathematics, coding, and reasoning for open-weight models.
Hybrid Reasoning Feature
Hermes 4 introduces a novel feature called “hybrid reasoning,” which allows users to switch between rapid responses and more in-depth, step-by-step thinking processes. When activated, the models generate their internal reasoning within special tags before delivering a final answer. This approach is reminiscent of OpenAI’s o1 reasoning models but offers complete transparency into the AI’s thought process.
Technical Achievements
The technical accomplishments behind Hermes 4 are noteworthy. In testing, the model’s largest version, with 405 billion parameters, achieved a score of 96.3% on the MATH-500 benchmark in reasoning mode and 81.9% on the challenging AIME’24 mathematics competition, rivaling or surpassing many proprietary systems that cost millions more to develop. AI researcher Rohan Paul highlighted the challenge of making thinking traces useful and verifiable without allowing for runaway reasoning, underscoring one of the technical advancements of this release.
Performance on RefusalBench
Notably, Hermes 4 achieved the highest score among all tested models on “RefusalBench,” a new benchmark created by Nous Research to assess how often AI systems decline to answer questions. In reasoning mode, the model scored 57.1%, significantly outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%). This performance indicates that Hermes 4 answered substantially more questions than competing AI systems on RefusalBench, which measures the frequency of model refusals to respond to user requests.
Innovative Training Infrastructure
The capabilities of Hermes 4 are underpinned by a sophisticated training infrastructure that Nous Research has developed over several years. The models were trained using two innovative systems: DataForge, a graph-based synthetic data generator, and Atropos, an open-source reinforcement learning framework. DataForge creates training data through “random walks” within directed graphs, transforming basic pre-training data into complex instruction-following examples. For instance, it can convert a Wikipedia article into a rap song and subsequently generate related questions and answers.
Atropos functions as a series of specialized training environments where AI models practice specific skills—such as mathematics, coding, tool use, and creative writing—receiving feedback only when they produce correct solutions. This “rejection sampling” method ensures that only verified, high-quality responses are included in the training data.
In summary, Nous Research has leveraged these advanced environments to curate the dataset for Hermes 4, setting a new standard in the realm of open-source AI models.