Nemotron 3 Super Scores 85.6% on Agentic Benchmarks

NVIDIA's latest open-source model release isn't just another entry in an increasingly crowded field. Nemotron 3 Super, announced at GTC 2026, was explicitly designed to be the brain of an AI agent — and its benchmark results suggest it might actually be good at that job.

The model scores 85.6% on PinchBench, the benchmark specifically designed to measure agent performance on tasks like multi-step tool use, browser automation, code execution, and instruction-following across long task horizons. That number puts it ahead of every other open-source model currently available, and within striking distance of proprietary models that cost significantly more to run.

What Nemotron 3 Super Actually Is

Nemotron 3 Super is a 120 billion total parameter mixture-of-experts model, but only 12 billion parameters are active per token. This MoE architecture is the same broad approach used by Mixtral and DeepSeek — you get the reasoning capacity of a large model while only paying the compute cost of activating a fraction of it during inference.

The architecture is a hybrid of two components: a Mamba state-space model for efficient sequence processing, and a standard Transformer attention layer for tasks that require global context. This hybrid approach — sometimes called a Mamba-Transformer or Jamba-style architecture — is particularly good at handling long-context tasks without the quadratic attention cost that slows pure Transformer models at scale.

At 120B total parameters with 12B active, Nemotron 3 Super delivers frontier-class reasoning at roughly the inference cost of a 12B dense model. The MoE routing means you're not running 120B parameters on every forward pass — just the specialists most relevant to each token.

The Benchmark That Actually Matters for Agents

Most LLM benchmarks measure things like mathematical reasoning (MATH, AIME) or factual recall (MMLU). These matter, but they don't tell you how a model behaves when you give it a browser, a file system, and a task that requires ten sequential decisions to complete.

PinchBench was designed specifically for this. It evaluates models on:

Multi-step tool calling — does the model call tools in the right order with the right arguments?
Error recovery — when a tool call fails, does the model adapt or get stuck?
Context persistence — does the model correctly track state across many steps?
Instruction fidelity — does it complete the actual task requested, or drift to something adjacent?

Nemotron 3 Super's 85.6% score is notably better than the next-best open-source model (Qwen 3 72B at around 79%) and competitive with Claude Sonnet 4.6, which sits above 90% on equivalent benchmarks. The gap to proprietary models is real but narrowing fast.

How NVIDIA Is Distributing It

Nemotron 3 Super is available through NVIDIA NIM — their hosted inference platform — at no cost for developers registered with the NVIDIA Developer Program. The API is OpenAI-compatible, which means any agent framework that can call GPT-4 can call Nemotron 3 Super with a base URL swap and an API key change.

Self-hosting the model requires significant infrastructure. At 120B total parameters, even with MoE efficiency, you're looking at multi-GPU deployment — typically 2-4 H100s for comfortable throughput. For most teams, the NIM API route is the practical path, at least until the model gets further quantized and community-optimized versions appear on Hugging Face.

What This Means for Agent Pipelines

The most interesting implication isn't Nemotron 3 Super in isolation — it's what a model purpose-built for agentic tasks means for how pipelines get designed. Until recently, the assumption was that you'd use a large proprietary model (Claude, GPT-4o) for any task requiring multi-step reasoning and reserve open models for simpler steps. Nemotron 3 Super challenges that assumption directly.

For production agent pipelines running on tight budgets, a practical routing strategy now looks something like this: use Haiku or Mistral Small for lightweight triage and metadata tasks, route complex multi-step reasoning to Nemotron 3 Super via the free NIM tier, and keep a Claude Sonnet budget as the final fallback for tasks where maximum reliability matters. That stack is entirely free at low-to-medium volumes.

The Bigger Picture

Nemotron 3 Super is part of a broader NVIDIA strategic push to become indispensable not just at the hardware layer (GPUs, NIM infrastructure) but at the model layer too. By releasing capable open models and hosting them for free, NVIDIA positions itself as the infrastructure provider for the open-source AI ecosystem — a smart play that mirrors what AWS did for cloud computing. Developers build on free tiers, production deployments eventually land on NVIDIA hardware.

For the agent developer community, the practical result is a genuinely capable free model optimised for the workloads that matter. For anyone building research pipelines, automation agents, or content systems — Nemotron 3 Super is now a serious first option rather than a curiosity.

Nemotron 3 Super Scores 85.6% on Agentic Benchmarks — Redefining Open-Source Agent Performance

What Nemotron 3 Super Actually Is

The Benchmark That Actually Matters for Agents

How NVIDIA Is Distributing It

What This Means for Agent Pipelines

The Bigger Picture

Sources