Are you looking for smarter insights delivered directly to your inbox? Subscribe to our weekly newsletters for essential updates tailored for leaders in enterprise AI, data, and security.
The Rise of Small Models
Small models are gaining significant attention. Following the launch of a new AI vision model from MIT spinoff Liquid AI, which is compact enough to fit on a smartwatch, and a model from Google that operates on a smartphone, Nvidia has now introduced its own small language model (SLM) named Nemotron-Nano-9B-V2. This model has achieved top performance in its category on selected benchmarks and includes a feature that allows users to enable or disable AI “reasoning,” which serves as a self-check before providing an answer.
Performance and Specifications
With 9 billion parameters, Nemotron-Nano-9B-V2 is larger than some of the multimillion-parameter models previously covered by VentureBeat. However, Nvidia emphasizes that this is a significant reduction from its original size of 12 billion parameters, specifically designed to fit on a single Nvidia A10 GPU. Oleksii Kuchiaev, Nvidia’s Director of AI Model Post-Training, explained on X that the model was pruned to 9 billion parameters to accommodate the A10, which is a popular choice for deployment. He also noted that this hybrid model can process larger batch sizes and is up to six times faster than similarly sized transformer models. For context, many leading large language models (LLMs) have over 70 billion parameters.
Challenges in AI Scaling
The landscape of enterprise AI is being reshaped by power caps, rising token costs, and inference delays. To address these challenges, join our exclusive salon to learn how top teams are:
– Transforming energy into a strategic advantage
– Designing efficient inference for real throughput gains
– Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead: https://bit.ly/4mwGngO
Multilingual Capabilities
Nemotron-Nano-9B-V2 supports multiple languages, including English, German, Spanish, French, Italian, Japanese, and extended support for Korean, Portuguese, Russian, and Chinese. This model is suitable for both instruction-following tasks and code generation. Users can find Nemotron-Nano-9B-V2 and its pre-training datasets available on Hugging Face and through Nvidia’s model catalog.
Innovative Architecture
The model is based on Nemotron-H, a set of hybrid Mamba-Transformer models that underpin Nvidia’s latest offerings. Unlike most popular LLMs that rely solely on attention layers, which can be costly in terms of memory and computation as sequence lengths increase, the Nemotron-H models incorporate selective state space models (SSMs). These SSMs can manage very long sequences of information by maintaining state, scaling linearly with sequence length and processing contexts much longer than standard self-attention without incurring the same memory and compute overhead. By replacing much of the attention with linear-time state space layers, the hybrid Mamba-Transformer achieves 2 to 3 times higher throughput on long contexts while maintaining comparable accuracy.
Features of Nemotron-Nano-9B-V2
Nemotron-Nano-9B-V2 is designed as a unified, text-only chat and reasoning model trained from scratch. By default, it generates a reasoning trace prior to delivering a final answer, although users can easily toggle this feature using control tokens like /think or /no_think. Additionally, the model introduces runtime “thinking budget” management, allowing developers to limit the number of tokens allocated to internal reasoning before the model finalizes a response. This feature aims to balance accuracy with latency, particularly in applications such as customer support or autonomous agents.
Evaluation and Accuracy
Evaluation results demonstrate competitive accuracy compared to other open small-scale models. In “reasoning on” mode, tested with the NeMo-Skills suite, Nemotron-Nano-9B-V2 achieved scores of 72.1% on AIME25, 97.8% on MATH500, 64.0% on GPQA, and 71.1% on LiveCodeBench. It also reported scores for instruction following and long-context benchmarks: 90.3% on IFEval, 78.9% on the RULER 128K test, along with smaller but measurable improvements on BFCL v3 and the HLE benchmark. Overall, Nano-9B-V2 demonstrated higher accuracy than Qwen3-8B, a common benchmark.
Nvidia has illustrated these findings with accuracy-versus-budget curves, showing how performance scales with increased token allowances for reasoning. The company suggests that careful management of the budget can help developers optimize both quality and latency in real-world applications. Both the Nano model and the Nemotron-H family utilize a combination of curated, web-sourced, and synthetic training data.