A technical and investment analysis of Nvidia’s most architecturally significant product launch since the H100
The Thesis in One Sentence
Nvidia just vertically integrated the inference stack — and Wall Street hasn’t fully priced it in yet.
Why Inference Is a Structurally Different Market
To understand the investment case, you need to understand what makes inference fundamentally different from training — not just technically, but economically.
Training is a capital event. You buy GPUs, burn power for weeks or months, and produce a model. Inference is an operating expense that never ends. Every user prompt, every agentic task, every API call is a billable inference event. As AI moves from R&D curiosity to enterprise utility, inference becomes the dominant workload — running 24/7 at massive concurrency against trillion-parameter models.
The core challenge is that inference is really two workloads under one header: prefill and decode. Prefill — processing the input prompt — is highly parallelized and GPU-friendly. Decode — the autoregressive generation of each output token — is highly serialized. GPUs were never designed for the latter, and as context windows grow into the millions of tokens, the inefficiency compounds. This is the gap the LPX is engineered to close.
The Architecture: Why SRAM Changes the Economics
The Groq 3 LPU’s key insight is that the memory bottleneck — not compute — is what limits inference speed and cost.
Traditional GPU inference relies on HBM (High Bandwidth Memory) stacked next to the die. Nvidia’s Rubin GPU carries 288 GB of HBM4 with 22 TB/s of bandwidth. That’s impressive for training. But for decode, the problem is that every output token requires retrieving the full weight set from off-chip HBM and writing results back — an expensive round trip that accumulates at scale.
The LPU inverts this. On the LPU, weights are already resident in SRAM at each processing station; only the activation tensors move between chips. During inference, the only data traveling between chip groups is the intermediate activation output from the previous stage — flowing chip to chip like a product moving along a conveyor belt, each station performing its assigned computation and passing the result forward.
The resulting bandwidth numbers are striking:
- 150 TB/s of on-chip SRAM bandwidth per LPU — roughly 7× higher per byte than the Rubin GPU’s HBM4
- 40 petabytes per second of on-chip SRAM bandwidth at full rack scale
- 640 TB/s of rack-scale chip-to-chip communication
The tradeoff: each LPX rack holds 256 LPUs with 128 GB total SRAM — a far smaller memory footprint than HBM-equipped GPUs. That’s why the system is heterogeneous by design: Rubin GPUs handle prefill (where large memory and parallelism win), and Groq LPUs handle latency-sensitive decode (where serialized token generation demands SRAM bandwidth over raw capacity).
The Disaggregated Inference Architecture
The full Vera Rubin + LPX system operationalizes this split through Nvidia Dynamo, the orchestration layer. Dynamo classifies incoming requests, orchestrates disaggregated serving via an AFD (Attention-FFN-Decode) loop, and routes prefill and attention operations to Rubin GPUs while directing latency-sensitive FFN and MoE decode to LPUs — maintaining high AI factory throughput while achieving the low tail latency essential for agentic and premium AI services.
This isn’t just a hardware story. The software integration is what makes the platform defensible. Nvidia confirmed that the LPU operates as an accelerator within the existing CUDA stack, with computation offloaded transparently on a per-token basis. Developers using PyTorch, TensorFlow, or JAX don’t need to rewrite anything — the compiler handles LPU offloading automatically.
The Performance Claims: What Do the Numbers Actually Mean?
Nvidia claims 35× higher inference throughput per megawatt and 10× more revenue opportunity versus Blackwell NVL72 for trillion-parameter models. Let’s unpack those carefully.
The 35× throughput claim is specifically for the decode phase of trillion-parameter models at high concurrency. The target throughput for agentic communications is up to 1,500 tokens per second — compared to typical GPU inference speeds an order of magnitude lower. At that rate, a single LPX rack can serve real-time multi-agent workflows that current infrastructure simply can’t sustain at viable cost.
The revenue opportunity metric is arguably the more important investor signal. When paired with Vera Rubin, Nvidia claims AI factories can produce premium tokens at scale, unlocking 10× more revenue per watt. For a hyperscaler or cloud provider charging per token, this translates directly to margin expansion without proportional capex growth.
At a listed price of $45 per million tokens at 300 tokens/second/megawatt, the LPX is positioned as a premium inference product — not a commodity cost-cutter. That’s a deliberate strategic choice that deserves scrutiny (see risk factors below).
The Competitive Moat: Why This Matters Beyond the Specs
The deeper story here is market structure, not just hardware. Nvidia’s move is a vertical integration play that mirrors what CUDA did for training — creating a platform lock-in that competitors can’t easily dislodge.
No competitor currently offers a complete training-to-inference platform. AMD, Intel, Cerebras, and SambaNova are all building inference chips, but none pairs GPU training dominance with inference dominance at data center scale.
The acquisition timing also reveals Nvidia’s strategic thinking. Groq was valued at $2.8 billion prior to the deal — Nvidia paid a roughly 7× premium. Post-GTC, with 35× throughput improvements demonstrated, the acquisition looks increasingly like Nvidia buying the inference market before anyone else realized it was for sale.
Importantly, the market validated the architecture before Nvidia even launched. AWS and Cerebras separately introduced a parallel disaggregated inference approach days before GTC 2026, suggesting this architectural pattern is becoming industry consensus. Nvidia is not inventing a niche — it is racing to own a category that the broader industry is already converging on.
What the Market Is Missing
Nvidia’s share price jumped roughly 2% in after-hours trading following the GTC announcements. But a 2% move for a product with this kind of structural implication suggests the market is still treating this as an incremental upgrade cycle rather than a platform shift.
Consider the revenue math. Nvidia posted a record $215.9 billion in fiscal 2026 revenue. The training hardware cycle that produced those numbers is already maturing — hyperscaler capex growth is decelerating. The inference market, by contrast, is in early innings. If the Groq 3 delivers on its claims, cheaper inference creates a flywheel: lower costs mean more AI-powered products, which means more inference demand, which means more LPU sales.
The $20B licensing cost is also worth contextualizing. At $215.9B in annual revenue, Nvidia can absorb that in under five weeks of sales. If the LPX captures even a fraction of the inference workload running on Blackwell today, the economics close very quickly.
Risk Factors
1. Memory capacity constraints. The SRAM-per-rack ceiling is real. Very large models or extremely long context windows may require stacking many LPX racks, adding cost and complexity. HBM-based competitors have more flexibility here.
2. Premium pricing exposure. At $45/million tokens, the LPX targets the high end of the inference market. If commodity inference providers (AWS Trainium, Google TPUs) compress pricing in the mid-market, the LPX’s addressable market could narrow.
3. Competitive response timelines. AMD is expected to respond at Computex in June, and Intel’s Gaudi 4 is in development. Neither has Nvidia’s software ecosystem advantage, but large hyperscalers have strong incentives to support alternatives.
4. Execution risk on the rollout. The LPX is scheduled for delivery through cloud service providers and OEMs in the second half of 2026. Yield issues, supply chain constraints, or integration delays could push meaningful revenue into FY2028.
The Bottom Line
The Groq 3 LPX is not a GPU upgrade. It’s a structural expansion of Nvidia’s addressable market into the most rapidly growing segment of AI infrastructure. The architecture is technically sound, the competitive moat is real, and the timing — as inference displaces training as the dominant AI workload — is near-perfect.
Jensen Huang called the inflection point of inference “arriving.” He’s not wrong. The question for investors is whether the current valuation reflects a company that just sells GPUs, or one that is building a closed-loop AI factory platform that no one else can fully replicate.
The gap between those two stories is where the opportunity lives.
Disclosure: This post is for informational purposes only and does not constitute financial advice. Always do your own research before making investment decisions.