NVIDIA Rubin: 10× lower inference cost and the new era of AI factories

What the Rubin platform is

Rubin is not a chip — it is a platform. It includes six interconnected components: Vera CPU + Rubin GPU (the main compute unit), NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet Switch, and starting March, the Groq 3 LPU integrated.

The bundle is sold as Vera Rubin NVL72: a server with 72 Rubin GPUs that behaves like a single compute unit, connected by NVLink 6 at 1.8 TB/s between GPUs.

Comparative vs. Blackwell

10×

Reduction in cost
per inference token

4×

Fewer GPUs
to train MoE

GPUs per NVL72
as one unit

The Hopper → Blackwell jump was already dramatic (~2.5× inference). Blackwell → Rubin is of the same order, but in a different dimension: cost per output unit rather than raw FLOPS.

The design is optimized for Mixture of Experts (MoE) models — the fashionable architecture in frontier labs (GPT-5.5, Claude Opus 4.7, Gemini 4 all use MoE). Rubin specifically accelerates the dispatch of tokens to experts.

Groq 3 LPU: the surprise switch

In March 2026, NVIDIA integrated the Groq 3 LPU into the platform. Groq (formerly NVIDIA's startup competitor in inference) was acquired in a deal that surprised the market. The Groq 3 LPU is an accelerator specialized in ultra-low latency inference: significantly higher tokens-per-second than general-purpose GPUs.

For use cases like real-time voice, interactive agents, or algorithmic trading, the Rubin + Groq 3 combination reduces TTFB (time-to-first-byte) latency below 100ms on large models.

Ising: the quantum nod

NVIDIA used GTC 2026 to announce Ising, the first open AI models aimed at accelerating the path to useful quantum computers. It is not a commercial product yet, but marks where NVIDIA is looking: the next bottleneck post-AI will be modeling quantum systems.

Availability and partners

Rubin is in full production. Rubin-based products will be available via partners in the second half of 2026. AWS, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure will be the first cloud providers to deploy Vera Rubin instances.

For companies: cloud prices will be lower than Blackwell per throughput unit, but initial availability will be limited. Anthropic and OpenAI are among the first guaranteed customers.

What it means for companies

The main impact is not for frontier labs — it's for companies running their own models or 24/7 agents at scale. Inference costs are typically 60-80% of the total operating cost of an AI product in production. Reducing them 10× changes business models.

Concrete case: a chatbot processing 1M messages/day with a large model cost ~$15K/month on Blackwell. On Rubin, it drops to ~$1.5K/month. That enables use cases that previously didn't close economically.

Conclusion

Rubin marks the transition from the "train bigger" era to "serve more efficient". For companies already operating AI in production, the 2026 question isn't whether to migrate to Rubin but when and to which provider. The window of privileged pricing will be short — the largest consumers will move first.