6 min read

The infrastructure AI actually requires

May 5, 2026

The cloud providers built supercomputers and called them AI infrastructure The orchestration lesson HPC already learned Fuzzball and RLC Pro AI: the foundation for the new HPC The infrastructure decision is happening now

Contributors

Gregory Kurtzer, CEO, CIQ

Subscribe to our newsletter

Spoiler alert: we've been doing it for decades!

Every era of technological revolution follows a single, predictable pattern: the tools outpace the infrastructure. A critical gap appears, often invisible to most, at the precise boundary between what new technology is capable of and what the underlying foundation can actually sustain. This infrastructure gap is exactly where the future of production AI is being fought and decided today.

This is the second time I'm watching this pattern play out. The first time was in 2015, when researchers at national laboratories had started using containers to manage their software environments, defining their work once and moving it between systems without rebuilding from scratch. Docker had made that possible, and the scientific community recognized immediately what it could mean for reproducibility and collaboration. The problem was that Docker was not built for HPC. It ran privileged daemons, circumvented the resource manager, and operated on security assumptions designed for enterprise workloads, not shared research systems where a single misconfiguration affects thousands of users. Every national laboratory and supercomputing center faced the same binary: open the door to Docker and accept the risks, or keep it closed and leave researchers to fight the same environmental problems indefinitely.

So I built Singularity at Lawrence Berkeley National Laboratory to solve the actual problem. Define a software environment once; run it anywhere, without privilege escalation, without a daemon, without asking a system administrator to trade security for portability. Adoption was fast. Within months, Singularity spread from zero to virtually the entire ecosystem of national labs and supercomputing centers. Years later, I moved Singularity to the Linux Foundation and renamed it to Apptainer, and that decade of work established something I have come to think of as a reliable pattern: the infrastructure gap always appears at the boundary between what the tools can do and what the foundation under them can hold.

That pattern is repeating now, and the boundary is in the same place it always was.

As I wrote last month, the AI Engineer era has an infrastructure problem. The tools to build and test AI became accessible before the infrastructure to run AI in production became sound. But the infrastructure gap in AI is not a new kind of problem. It is the same kind of problem the HPC community spent thirty years learning to solve, and the organizations that recognize that connection will build on the right foundation from the start.

AI training is supercomputing. The industry just took a decade to recognize the same pattern.

The connection between AI and HPC is architectural. Training a modern large language model requires the same nervous system that defined national laboratory supercomputers decades ago. Applications running over many systems and GPUs are synchronized over ultra-high-speed interconnects like InfiniBand and RDMA fabric, and behave as if they were a single processor. The latency and bandwidth requirements match scientific supercomputing. The research community recognized this convergence years before enterprise infrastructure caught up to it. A 2020 paper from the National Science Foundation put it plainly: GPU systems originally developed for HPC became essential for machine learning, and those systems were further optimized for ML, with improvements that then traveled back into HPC simulations. The convergence runs in both directions because the underlying architecture is the same. ISC High Performance 2026, Europe's largest gathering of the HPC and AI communities, has made this convergence the organizing principle of its annual program, with the explicit acknowledgment that the convergence of HPC and AI is actively reshaping enterprise IT.

HPC was always defined by its architecture rather than its applications. Nuclear physics, weather simulation, drug discovery: decades of different workloads ran on the same fundamental requirement for coordinated parallel computing at massive scale. AI training is the latest workload to run on that architecture, and it changed the application without changing the requirement. This means that organizations building AI infrastructure today are building supercomputing infrastructure, and the procurement decisions, operational posture, and orchestration layer they choose will carry the same long-term consequences those decisions carried for national laboratories over the past thirty years.

Training established the architecture requirement. Inference is where that requirement now lives at production scale. Serving a model in production demands the same low-latency interconnects, the same coordinated scheduling across GPU clusters, and the same memory bandwidth as training: continuously, at request volume, with latency SLAs that training never had to meet. Training accounts for roughly ten to fifteen percent of AI infrastructure demand today. The other eighty-five percent is inference: real requests, real users, real consequences when the infrastructure cannot hold the load. Organizations building production AI are building for inference at HPC scale, whether they have named it that way or not, and the operational posture that handles that load is the same one HPC operators spent thirty years developing.

The cloud providers built supercomputers and called them AI infrastructure

The major cloud providers understood this convergence early and built accordingly. Google's AI Hypercomputer is a vertically co-designed system: custom silicon in TPUs, custom networking fabric in the Virgo Network announced at Google Cloud Next 2026, and software optimized for AI training and inference from the hardware layer up. Amazon followed a parallel path with Trainium. These are purpose-built HPC systems designed for the demanding architecture that AI training at scale requires, the same architecture national laboratories have operated for decades. HPE and NVIDIA announced this spring that research institutions including Argonne National Laboratory have adopted their full-stack AI infrastructure to be built explicitly for at-scale and sovereign environments. The framing has changed. The architecture is the same.

The organizations that grasp this are building AI infrastructure with the operational maturity that HPC pioneered: careful hardware selection, portable orchestration layers, and security postures that survive the next generation of hardware. The ones that treat AI compute as a general-purpose workload will hit the ceiling that every HPC operator hit before them, when the hardware evolves faster than the orchestration layer was designed to follow.

The orchestration lesson HPC already learned

The HPC community spent more than thirty years learning one lesson about orchestration the hard way: the workload scheduler and resource manager that ties you to a single vendor's infrastructure becomes the bottleneck the moment that vendor's roadmap diverges from your operational reality. HPC hardware generations changed on timelines no procurement cycle could fully anticipate, and the organizations that built portable orchestration layers were the ones with the operational flexibility to absorb those changes. The ones that built to a single provider's primitives paid the re-platforming cost every time the hardware moved.

The AI hardware landscape is moving faster than HPC hardware ever did. A team training on a TPU superpod today may need to move to on-premises NVIDIA hardware next year, and teams running inference on one cloud provider's accelerators may need to shift workloads based on cost, compliance, or capacity at a moment's notice. In that environment, portability in the orchestration layer is the property that determines whether infrastructure investments accelerate value or depreciate every time the hardware generation changes.

Fuzzball and RLC Pro AI: the foundation for the new HPC

Fuzzball was built from the institutional knowledge the HPC community accumulated over those thirty years, applied to the infrastructure reality that AI teams are operating in today. It provides HPC resource management, provisioning, and scheduling across on-premises and cloud resources, with heterogeneous compute treated as a unified pool. The same workflows that run on a bare-metal cluster in a national laboratory run on a Google TPU cluster or a private NVIDIA Blackwell pod. The workflow stays portable across environments when the hardware changes. Fuzzball Service Endpoints extend that further, unifying model training, fine-tuning, and inference in a single portable workflow so organizations can develop and serve AI models entirely from on-premises or hybrid environments, with complete control over proprietary and sensitive data.

RLC Pro AI is the OS layer that makes that foundation operationally sound. It ships pre-validated for the latest GPU and accelerator hardware, so engineering teams run production jobs from day one rather than spending weeks on integration work before the first training job executes. Together, Fuzzball and RLC Pro AI give production AI teams the same operational model that national laboratories use to run the world's most demanding workloads: an orchestration layer that travels with the workload, an OS that treats AI compute as a first-class concern, and a security posture that holds when the environment changes.

The infrastructure foundation once available only to organizations with the largest capex budgets and the deepest infrastructure teams is now accessible to any team willing to build on a portable, operationally sound foundation. That is the same shift Singularity represented for the research community in 2015. The ecosystem moved to it fast because the problem was pervasive and the solution was right. The problem is pervasive again.

The infrastructure decision is happening now

The organizations planning AI infrastructure investments in 2026 are making decisions that will define their operational flexibility for the next several years, in a hardware environment that will change faster than any single procurement cycle can anticipate. The teams that build on a portable, HPC-grade foundation now will not be the ones re-platforming their orchestration layer when the next generation of hardware arrives. They will be the ones whose infrastructure sees multiples of value as the hardware evolves, because the operational model was built to move with it.

That is the pattern Singularity established. It is the pattern the AI infrastructure era is about to repeat.