NVIDIA Unveils Groq 3 LPX Rack System for Ultra-Low Latency AI Inference
Timothy Morano
Mar 16, 2026 21:19
NVIDIA’s new Groq 3 LPX delivers 315 PFLOPS and 35x better inference throughput per megawatt, targeting agentic AI workloads on the Vera Rubin platform.
NVIDIA has pulled back the curtain on the Groq 3 LPX, a rack-scale inference accelerator built around 256 interconnected Language Processing Units that the company claims delivers up to 35x higher throughput per megawatt for trillion-parameter models. The system arrives as the seventh chip in full production for the Vera Rubin platform, following NVIDIA’s $20 billion acquisition of Groq’s intellectual property.
The timing matters. As AI workloads shift from batch processing toward real-time agentic systems—where multiple AI agents coordinate continuously—the bottleneck isn’t raw compute anymore. It’s latency. NVIDIA is betting that the future demands infrastructure capable of generating tokens at speeds approaching 1,000 per second per user, fast enough to enable what the company calls “speed of thought computing.”
What’s Actually Inside the Box
The LPX rack houses 32 liquid-cooled compute trays, each packing eight LP30 LPU chips. At full scale, the system delivers 315 PFLOPS of inference compute with 128 GB of on-chip SRAM and 40 PB/s of memory bandwidth. Scale-up bandwidth hits 640 TB/s across the 256-chip configuration.
The architecture diverges sharply from traditional GPU approaches. Where GPUs rely on massive parallel throughput and external High Bandwidth Memory, LPUs keep their working set—weights, activations, KV cache state—entirely in on-chip SRAM. The compiler controls data movement explicitly rather than depending on hardware cache heuristics. NVIDIA claims this produces more deterministic execution with reduced latency jitter.
Each LPU connects through 96 chip-to-chip links running at 112 Gbps, enabling 2.5 TB/s of bidirectional bandwidth per chip. The plesiosynchronous protocol aligns hundreds of accelerators to operate as a single coordinated system.
The Heterogeneous Inference Play
NVIDIA isn’t positioning LPX as a GPU replacement. Instead, it’s designed to work alongside Vera Rubin NVL72 systems in what the company calls “attention-FFN disaggregation.” GPUs handle the heavy lifting—long-context prefill, decode attention over accumulated KV caches—while LPUs accelerate the latency-sensitive feed-forward network execution within the decode loop.
The NVIDIA Dynamo orchestration layer manages this split, routing work based on latency targets and shuffling intermediate activations between processors. For speculative decoding, LPX can serve as the draft-generation engine while GPUs handle verification.
NVIDIA claims this heterogeneous approach unlocks up to 10x more revenue per megawatt compared to GB200 NVL72 systems for premium interactive workloads. The math assumes operators can charge meaningfully more for responsive AI services than for throughput-optimized batch processing.
Market Implications
The LPX rack is slated for availability in the second half of 2026. For data center operators weighing infrastructure investments, the announcement signals NVIDIA’s conviction that inference economics will increasingly favor specialized hardware as agentic AI scales.
Whether the 35x efficiency gains hold up under real-world production loads remains to be seen. But for anyone building multi-agent systems or interactive AI products where response latency directly impacts user experience, the architectural shift toward heterogeneous inference is worth tracking closely.
Image source: Shutterstock

