MarkTechPost · 2026-06-04

NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

news_article.exe📰#OpenAINVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running AgentsNVIDIA AI releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents June 4, 2026 2 views Source: MarkTechPostNVIDIA has released Nemotron 3 Ultra, the largest model in its Nemotron 3 family. It targets a specific problem: long-running agents that plan, call tools, and reason across many turns. As agents run...NVIDIA has released Nemotron 3 Ultra, the largest model in its Nemotron 3 family. faster and cheaper. What is Nemotron 3 Ultra Nemotron 3 Ultra is a 550 billion total parameter Mixture-of-Experts (MoE) model. Only 55 billion parameters are active per token. The MoE design improves accuracy per active parameter. It uses a hybrid Mamba-Attention architecture instead of a pure Transformer. Mamba layers handle long sequences with sub-quadratic scaling. 20 trillion text tokens. Context was then extended to 1 million tokens. It was post-trained using Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). NVIDIA team reports up to roughly 6x higher inference throughput than comparable open LLMs, at on-par accuracy. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf The Architecture The model has 108 layers and a model dimension of 8,192. It uses 64 query heads and only 2 key-value heads, which keeps the KV cache small. Each MoE layer holds 512 experts, with the top 22 activated per token. Three design choices stand out: LatentMoE routes experts more efficiently. It buys more routed experts at fixed inference cost by trading away hidden-dimension width. NVIDIA team reports better accuracy per parameter than standard granular MoEs. Multi-Token Prediction (MTP) predicts several future tokens in one forward pass. It enables native speculative decoding for faster generation. Two MTP heads share parameters during training. NVFP4 pre-training uses the E2M1 4-bit datatype with two-dimensional block quantization on weights. NVIDIA team calls this the largest-scale demonstration of stable, accurate NVFP4 training to date. The hybrid Mamba-Attention stack are quite important for agents. Mambas per-step decode cost stays constant as sequence length grows. That is why throughput gains widen on long, decode-heavy workloads. Pretraining and the Data Release Pretraining used a Warmup-Stable-Decay learning rate schedule over 20 trillion tokens. It was split into two phases. The first 15 trillion tokens biased for diversity. The final 5 trillion biased for high-quality data. NVIDIA team also released new domain-specific pretraining datasets. These include 173 billion refreshed GitHub code tokens. In a Nemotron 3 Nano ablation, a synthetic legal set raised a proxy LegalBench average from 64.6 to 74.7. In a similar ablation, a Wiki-based fact-seeking set raised proxy SimpleQA from 40.2 to 50.2. The post-training release is also large. NVIDIA adds 10 million new SFT samples and 1 million new RL tasks. It adds 15 new RL environments. Cumulative Nemotron open totals reach 50M SFT samples, 2M RL tasks, and 55 RL environments. tokens, traced to moving output-layer gradient reduction from FP32 to BF16. The MTP gradient contribution was effectively lost in BF16s 7 mantissa bits. Reverting to FP32 gradient reduction re-stabilized training. The second divergence, near 16 trillion tokens, had no confirmed root cause. NVIDIA mitigated it by annealing the learning rate early. It then cut the total token horizon to 20 trillion tokens. Post-Training: SFT, RLVR, and MOPD The post-training pipeline runs SFT, then unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. The whole loop can repeat for several cycles. RLVR stands for Reinforcement Learning with Verifiable Reward. It trains across many environments at once: terminal use, software engineering, search, math, code, safety, and more. The reward in these settings is often sparse and environment-dependent. MOPD is the main new post-training method. Mixed-environment RLVR dilutes the learning signal as the number of environments grows. To address this, NVIDIA team trained more than ten domain-specialized teacher models. Each teacher has its own training pipeline. During MOPD, the student model generates its own rollouts across domains. Each rollout is scored by the matching teacher with dense, token-level guidance. This is a denser signal than RLVRs sparse rewards. The process runs asynchronously, with rollout generation, teacher scoring, and student updates pipelined. MOPD is also iterative. After one MOPD checkpoint, new teachers are initialized from the improved student. Their gains merge back into the next round. NVIDIA team ran two MOPD iterations for Nemotron 3 Ultra. One practical caveat is worth noting. MOPD works best when student rollouts stay within the teachers support. samples. Reasoning Effort Control Nemotron 3 Ultra supports three reasoning modes: reasoning-off, regular, and medium-effort. The regular and medium modes also accept an inference-time budget control. Medium-effort is the efficiency lever. NVIDIA team reports it uses about 2.5x fewer tokens than regular mode. The cost is roughly a 7% drop in accuracy. For high-volume agent steps, that trade can lower spend meaningfully. The Benchmark The comparisons in the NVIDIAs research report use GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B), among others. The picture is competitive rather than dominant. On agentic tasks, Nemotron 3 Ultra posts 90.0 on PinchBench and 56.0 on ProfBench (Search). NVIDIA team reserved both as held-out generalization gates, scored only once on the final model. It 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1. On Terminal Bench, Kimi-K2.6 leads at 67.2. On reasoning, it scores 570.0 on IOI 2025. NVIDIA team frames this as top-3-human-level competitive programming. On AA-Omniscience, it records the highest non-hallucination score in the set at scores 78.7. That suggests a lower tendency to answer when uncertain. Long context holds up at scale. The model scores 94.7 on RULER at 1 million tokens. Several larger comparison models top out at 256K context. On an 8K input / 64K output setting at NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1. It is 4.8x faster than Kimi-K2.6 and 1.6x faster than Qwen-3.5. Note: Nemotrons numbers use TRT-LLM, while the others use vLLM. The trade-off is visible on prefill-heavy work. On a 50K input / 2K output setting, it trails Qwen-3.5, because prefill cost tracks active parameters. NVIDIA team also reports up to 30% lower cost to task completion, from fewer tokens per turn on SWE-Bench and Terminal Bench. NVIDIA team also stresses harness robustness. The model is trained under multiple agent harnesses per task type, not one. SWE-Bench Verified scores stay between 65% and 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent. The goal is consistent behavior regardless of deployment framework. Quantization and Deployment NVIDIA team ships a single NVFP4 checkpoint. On Blackwell it runs with native FP4 math. On Hopper it runs as W4A16, since Hopper lacks native FP4 tensor cores. The final solution operates at 5.03 bits-per-element. It mixes NVFP4 routed experts with FP8 layers for shared experts and Mamba linears. Attention layers stay in BF16. NVIDIA team found accuracy saturated below this budget, so higher precision added no measurable gain. The reduced weight footprint has a deployment benefit. W4A16 path leaves room to fit MTP weights on a single 8-GPU H100 node. An FP8 checkpoint could not, without spanning two nodes. Key Takeaways Nemotron 3 Ultra is a 550B open MoE (55B active) using a hybrid Mamba-Attention design for long-running agents. NVIDIA reports up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy (5.9x vs GLM-5.1 on 8K/64K). It pairs a 1M-token context with the highest non-hallucination score in its comparison set (78.7 on AA-Omniscience). Post-training centers on Multi-teacher On-Policy Distillation (MOPD), distilling 10+ specialized teachers into one student. Weights, training data, and recipes ship openly under OpenMDW-1.1, with one NVFP4 checkpoint for Blackwell, Hopper, and Ampere. Marktechpost’s Visual Explainer NVIDIA Nemotron 3 Ultra SLIDE 1 / 8 Open Model Release Nemotron 3 Ultra: a 550B open MoE built for long-running agents An open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks. Total / Active 550B / 55B Sparse MoE, 55B active per token Context 1M tokens Extended after 20T-token pretraining Throughput ~6x Up to ~6x vs comparable open LLMs License OpenMDW-1.1 Open weights, data, and recipes Pre-trained on 20T tokens, then post-trained with SFT, RLVR, and Multi-teacher On-Policy Distillation (MOPD). What It Is A hybrid Mamba-Attention MoE, not a pure Transformer Hybrid stack: Mamba layers scale sub-quadratically; a few Attention layers preserve precise recall. Sparse MoE: 550B total parameters, 55B active per token, improving accuracy per active parameter. Long context: pretrained on 20T text tokens, then extended to a 1M-token window. Open release: base, post-trained, and NVFP4 checkpoints, plus training data and recipes. Throughput gains come mainly from the hybrid Mamba-Attention design, which bounds KV-cache footprint. Architecture 108 layers, 512 experts per layer, top-22 routing Layers 108 Model dimension 8,192 Attention 64 / 2 Query heads / KV heads Experts 512 Top-22 activated per token Precision NVFP4 E2M1, 2D block quantization Key techniques LatentMoE: more routed experts at fixed inference cost by trading hidden-dimension width. Multi-Token Prediction (MTP): predicts several tokens per pass; two heads share parameters. NVFP4 pre-training: NVIDIAs largest-scale stable, accurate FP4 training run to date. Pretraining & Data 20T tokens in two phases, plus new open datasets Two-phase curriculum: 15T tokens biased for diversity, then 5T biased for quality. Code refresh: 173B new GitHub tokens with a September 30, 2025 cutoff. Domain data (Nano ablations): legal lifted proxy LegalBench 64.6 to 74.7; Wiki lifted proxy SimpleQA 40.2 to 50.2. Post-training data: +10M SFT samples and +1M RL tasks; totals reach 50M SFT, 2M RL tasks, 55 environments. NVIDIA documents two loss divergences (near 8T and 16T tokens) and the fixes used to stabilize training. Post-Training MOPD: distilling 10+ specialized teachers into one student SFT RLVR MOPD Warmup MOPD MTP Boosting Why MOPD: mixed-environment RLVR dilutes the signal as the number of environments grows. How it works: the student generates rollouts; each teacher scores them with dense token-level guidance. Asynchronous: rollout generation, teacher scoring, and student updates run pipelined. Iterative: NVIDIA ran two MOPD iterations, re-initializing teachers from the improved student. A short SFT warmup keeps student rollouts within each teachers support before distillation. Benchmarks Competitive across agentic, reasoning, and long-context tasks PinchBench (held-out) 90.0 Top tier of evaluated open models SWE-Bench Verified 71.9 Software engineering agents IOI 2025 570.0 Top-3-human-level (NVIDIA framing) RULER @ 1M 94.7 Long-context retrieval ProfBench (Search): 56.0, the second held-out generalization gate. AA-Omniscience: highest non-hallucination score in the set at 78.7. Terminal Bench 2.1: 56.4, where Kimi-K2.6 leads at 67.2. Throughput & Efficiency Faster on decode-heavy work; budget control for cost vs GLM-5.1 5.9x 8K in / 64K out, NVFP4 on GB200 vs Kimi-K2.6 4.8x Same decode-heavy setting vs Qwen-3.5 1.6x Trails Qwen on prefill-heavy work Cost to complete ~30% Lower, from fewer tokens per turn Reasoning modes: reasoning-off, regular, and medium-effort, with inference-time budget control. Medium-effort: about 2.5x fewer tokens for roughly a 7% accuracy drop. Throughput is reported with TRT-LLM for Nemotron and vLLM for the other models. Quantization, Licensing & Takeaways One NVFP4 checkpoint, across NVIDIA GPU generations Single checkpoint: native FP4 on Blackwell, W4A16 on Hopper, also runs on Ampere. Operating point: 5.03 bits-per-element, mixing NVFP4 experts with FP8 and BF16 layers. Footprint win: the W4A16 path fits MTP weights on a single 8-GPU H100 node. Fully open: weights, data, and recipes under OpenMDW-1.1; fine-tune via LoRA, SFT, or RL. Not the top scorer on every benchmark. The design favors throughput, long context, and reliability for agents. Prev Next Curated by Marktechpost AI/ML research & dev news for engineers and data scientists Sources: NVIDIA Nemotron 3 Ultra technical report & blog · Verified Jun 4, 2026 Where to Use Nemotron 3 Ultra Deploy & Run Where to Use Nemotron 3 Ultra: Inference Providers Verified hosting and access points for NVIDIA's open 550B-A55B model. Each card opens in a new tab. Nebius Token FactoryFeatured Listed by NVIDIA as a launch cloud & inference partner for Nemotron 3 Ultra. Managed OpenAI-compatible API, dedicated GPU endpoints, and fine-tuning, with transparent per-token pricing. nebius.com/services/token-factory NVIDIA NIMOfficial Try the hosted API on build.nvidia.com or deploy the NIM microservice anywhere. build.nvidia.com Hugging FaceWeights Download the BF16 and NVFP4 checkpoints, plus training data and huggingface.co/nvidia OpenRouterAPI Unified API access across providers; a free tier is also listed. openrouter.ai/nvidia Together AIAPI Serverless inference through an OpenAI-compatible API endpoint. together.ai/models PerplexityApp Use Nemotron 3 Ultra recipes inside Perplexity with a Pro subscription or via API. perplexity.ai GitHub (NeMo)Self-host Cookbooks, deployment recipes, and agent-harness getting-started instructions. github.com/NVIDIA-NeMo Curated by Marktechpost AI/ML research & dev news for engineers Sources: NVIDIA Nemotron 3 Ultra blog & provider listings · Verified Jun 4, 2026 Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents appeared first on MarkTechPost.> Share: Copy link

Read Original