Tokens per watt is the new CEO metric. Here's where your OS fits.

March 19, 2026

RLC Pro AIRLC AIrlc-pro-ai

2 min read

Tokens per watt is a new AI efficiency metric, and at GTC 2026, Jensen Huang introduced it as a new method of measuring the value of AI infrastructure.

Huang’s framing is simple and pointed: data centers are power-constrained, and every watt has a cost, a physical limit, and an opportunity cost. Another way to consider this is when an organization generates more tokens from the same power envelope it also generates more revenue, processes more requests, and operates with better economics than one that does not.

Tokens per watt is the new measure of whether your AI infrastructure investment is working.

Huang backed the claim with a number that is difficult to ignore. The Vera Rubin platform (a next-gen AI computing architecture designed for agentic AI and large-scale AI factories), generates 700 million tokens per second, whereas Blackwell (a next-generation accelerated computing architecture designed for trillion-parameter AI) is only capable of delivering 22 million in the same power envelope. That is a 350x improvement in two years. When you consider that Moore's Law, over the same period, would have delivered about a 1.5x gain, you realize that the gap between hardware-driven progress and full-stack optimization is now the size of two orders of magnitude.

That number represents what happens when every layer of the stack is designed to work together: silicon, software, networking, and storage, all optimized for the same output metric.

Now consider what the OS layer contributes to that equation.

Every unresolved CVE in the kernel is a potential patch cycle that takes engineering time and introduces the risk of instability. Every driver installed manually is a compatibility variable that may not survive the next kernel update. Every PyTorch configuration that was not validated against the rest of the stack is a ceiling on throughput that the hardware never imposed. In this respect, the OS is not a neutral layer; it either contributes to inference throughput per watt or erodes it.

RLC Pro AI reduces OS tuning and update cycles from hours to minutes, keeping GPU fleets running and productive, and also delivers up to 10% more throughput for text inference on identical hardware, without touching the model or the application. That is what the OS layer delivers when it is built for inference rather than assembled for it.

This is the argument CIQ has been making about AI infrastructure for some time, and GTC 2026 gave it the macro framing necessary to land in boardroom conversations about AI spend.

The tech industry has spent a decade innovating around the OS. Developers have added containers, orchestration layers, and integrations on top of a base that was never rethought. The result is that your workloads get designed around your infrastructure rather than your infrastructure being designed for your workloads. That approach was acceptable when inference was secondary. However, this method is not acceptable when tokens per watt determine your unit economics.

RLC Pro AI is built on a different premise. Every kernel parameter, every component in the stack, every configuration decision was made with one metric in mind…more AI output per dollar of infrastructure investment. The NVIDIA CUDA and DOCA-OFED stack ships pre-validated, tested as a complete system, so the OS layer adds to throughput rather than limiting it and nothing needs to be assembled or maintained by hand.

Tokens per watt is now a business metric. The OS under your GPU fleet is part of that number, and RLC Pro AI is Enterprise Linux built to make it a better one.

Learn more about RLC Pro AI and get started → https://ciq.com/products/rocky-linux/pro/ai/

Ready to learn more about what CIQ can do for you?

Get in touch