CIQ

Unlock the Power of Turnkey HPC Interconnects

April 27, 2023

High Performance Computing (HPC) systems are essential for researchers and businesses that require processing power for data-intensive tasks. However, configuring and managing HPC systems can be daunting, especially for those who need specialized technical knowledge.

In this webinar, we will discuss the importance of interconnects in HPC systems and how they can impact the overall performance of your workloads. We will explore the benefits of turnkey HPC solutions, which offer pre-configured, fully integrated systems that can be deployed quickly and easily, allowing you to focus on your research or business needs.

By attending this webinar, you will better understand the role interconnects play in HPC systems and how turnkey solutions can help you accelerate your research or business workflows. We welcome your questions and look forward to seeing you there!

Webinar Synopsis:

  • Introductions

  • What are Interconnects

  • Types of Interconnects

  • What is a Turnkey HPC Interconnect

  • Bandwidth for Medium and Small Clusters

  • Optimal Topology

  • The Role of Software

  • Considering Costs

  • Cosmic Ray Interference

  • Building the Perfect System

  • Measuring Performance

  • State of the Art Interconnect

  • Where is it All Heading

Speakers:

  • Justin Burdine, Director of Solution Engineering, CIQ

  • Rose Stein, Sales Operations Administrator, CIQ

  • Gary Jung, HPC General Manager at LBNL and UC Berkeley

  • Jonathon Anderson, Senior HPC System Engineer, CIQ

  • Alan Sill, Managing Director, High Performance Computing Center at TTU

  • David DeBonis, Computer Scientist, CIQ

  • Matthew Dosanjh, Senior Member of Technical Staff, Sandia National Laboratories


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Narrator:

Good morning, good afternoon, and good evening wherever you are. Thank you for joining. At CIQ, we're focused on powering the next generation of software infrastructure, leveraging the capabilities of cloud, hyperscale, and HPC. From research to the enterprise, our customers rely on us for the ultimate Rocky Linux, Warewulf, and Apptainer support escalation. We provide deep development capabilities and solutions, all delivered in the collaborative spirit of open source.

Justin Burdine:

Well, hey everyone. Welcome to another CIQ webinar. Thanks for joining us. Rose, I was all excited to, oh, first of all, I'm joined by my partner in crime, Rose Stein here, and we are excited to be taken over the kingdom. I guess Zane's on a plane, so he couldn't make it this round. And I'm sitting in for him. And, we'll see how it goes. I don't know what they're thinking, Rose.

Rose Stein:

I imagine that he's watching though, and we'll probably get a couple of side messages of what are you guys talking about? What are you doing?

Justin Burdine:

Exactly. Exactly. Well, we'll see. We'll see. We're looking forward to having back next week, but this week, what are we talking about? What are we looking to discuss?

Rose Stein:

This is actually interesting and I am really excited to hear different people's perspectives on this. So it's unlocking the power of Turnkey HPC interconnects to accelerate workloads.

Justin Burdine:

Okay. Well, please please tell me we have some smart people to talk about this because I am very new to this. This will be new stuff for me. I know what an interconnect is, but it's been 20 years since I've even thought about that from an HPC perspective. So who do we have on? Let's unleash the, we got Gary. Gary, fantastic.

Rose Stein:

Awesome. Hey Gary,

Justin Burdine:

Welcome Gary.

Gary Jung:

Hi. Hi.

Rose Stein:

Nice to see you.

Justin Burdine:

Oh, look at that. I knew we had other people in the wings. All right. Well, let's go ahead and intro. Gary, we'll go ahead and start out with you.

Introductions [6:49]

Gary Jung:

Hi. My name is Gary Jung. I run the institutional high performance computing for Lawrence Berkeley Laboratory, and I also manage the UC Berkeley institutional high performance computing program also.

Justin Burdine:

Awesome, awesome. Jonathan?

Jonathan Anderson:

My name's Jonathan Anderson. I'm a solutions architect here with CIQ and I have a background in academic high performance computing.

Justin Burdine:

All right, Alan?

Alan Sill:

Alan Sill. Like Gary, I wear two hats. I run the High Performance Computing Center here at Texas Tech University, and I also am one of several co-directors of a multi university industry University Cooperative Research Center in Cloud and Autonomic Computing with funding from the National Science Foundation.

Justin Burdine:

All right. Dave?

David DeBonis:

Hi, I'm David DeBonis. I'm a computer scientist over here at CIQ, with background in HPC embedded systems, mostly focused on system software.

Justin Burdine:

All right. And we got a new face here, Matthew.

Matthew Dosanjh:

Hi, I'm Matthew Jo, senior member of technical staff at San Diego National Laboratories. I worked on HPC middleware and research around that, particularly MPI, and open and smart nicks.

Justin Burdine:

Awesome. Awesome. Well, thank you guys for joining us. I really appreciate it. Hopefully this you'll be able to fill us in on all sorts of questions that we've got here about HPC interconnects. So, Rose got the first question?

Rose Stein:

You gotta break it down for me, though, and, maybe Matthew we'll start with you. So you are a new face, hi, it's nice to meet you. Thanks for being here. Can you break 

down what are interconnects and then why is this important to the HPC system?

What are Interconnects [8:43]

Matthew Dosanjh:

So, interconnects are our form of networking in a sense. We run on trusted systems with very low latency needs for networks for scientific applications so that we can run hundreds of iterations across thousands of nodes a second. And so having the overhead of ethernet and that network stack doesn't really work for our use case. So there's been a long history in different companies making Intergen-X ranging from tray to the InfiniBand stack. And now we have a new generation coming out. But the idea of having a low latency networking solution for a trusted user base is the big thing.

Rose Stein:

Awesome. I appreciate that. Thank you. So what are some of the different types of interconnects? Gary, you want to jump in there?

Types of Interconnects [10:01]

Gary Jung:

Boy, there's the standard ethernet interconnect that Matthew was talking about that everybody uses to connect all the compute nodes. But usually when we refer to an interconnect or a fabric for a high performance computing system, then we're looking for something that sometimes could be more a performant than an ethernet connection. So the one that most people have converged on these days is InfiniBand, and there's different generations of that. But the current generation of that, that is currently on the market, is 400 gigabits per second. So that is quite a bit faster than most people would deploy with ethernet. And so, in addition to the higher performance, you would also get low latency. So, it'd be very low latency for tightly coupled jobs where there's communication between the compute nodes.

Justin Burdine:

Hey, Gary. What, what are those? What are the actual physical connections? Or is that all optical?

Gary Jung:

It can be, if they're within the rack, you can use passive copper cables and they use a QSFP connection. And if you go out of the rack, then you're going to be required to use optical connections. InfiniBand can go quite a distance. You can probably go to other areas within your campus with just optics.

Alan Sill:

At some sacrifice of latency. That's correct, yes. So, I think we understand that the fundamental purpose of interconnects is to provide parallelism. The recent talk by Torsten Hoefler at the HPC advisor council, I'll see if I can find a link, talked about the three types of parallelism, data parallelism, pipeline parallelism, and operator parallelism. And that's all pretty abstract. But the idea of a high speed interconnect is to allow more than one node to be used at once, really a simple fundamental concept. We have to understand that there've been a lot of complications lately. And certainly the field has always explored alternatives. We've seen people try to produce switch list fabrics, all optical interconnects. Rockport is popular in that area, work in that area. 

There's always been alternative technologies. Cray has its Slingshot which is derived from ethernet. Omnipath is still on path to coming back. Cornellis has been pushing, but I think Gary's right, that InfiniBand is what people are familiar with in the sort of generic plain vanilla HPC cluster. Other complications have to do with ways that we're putting accelerators into the mix. So GPU's often are deployed in sets with dedicated interconnects that bypass the conventional ones. And, also we have to unveil a hidden secret, which is that many clusters, most clusters use the interconnect fabric, not just to allow computations to take place across multiple nodes, but to get high speed access to storage. So there's a hidden overloading of a function there that sometimes trips us up a little bit.

Gary Jung:

I wanted to just toss in this just for a comparison, but when interconnects... when I used to look at the top 500 list, one of the things that was interesting is that you could look at the different rankings of the clusters, and it would approximately, if you compare an InfiniBand cluster to an ethernet cluster, it took approximately twice the compute power of the ethernet cluster to come up with the same score on the HP lin pack as an InfiniBand cluster, or at that time, even a marinette cluster because of the latency and just the HPL depends on the parallelism that is supported by the low latency that Allen was mentioning.

David DeBonis:

Alan, you mentioned Torsten and I think he said during SE this year, Torsten of course is going to say this, but networks are the future of HPC. And it is because they're moving so much data around, and usually a lot of jobs, I would say, are mostly communication bound. Some of the aspects of RDMA, other sorts of technologies, the separate channels, help to make it more efficient at the fabric layer and make it more parallelizable.

Rose Stein:

So there's another thing, guys, that as we're talking about this topic, another word that is fun. Because every time I ask somebody they have a different, a different answer. But it's talking about Turnkey HPC, right? So unlocking the power of Turnkey HPC interconnects to accelerate workloads. What is, what does that mean to you?

What is a Turnkey HPC Interconnect [16:02]

Alan Sill:

I have to confess that that left me a bit baffled. I have never found a Turnkey interconnect. They're highly cantankerous herds of thoroughbreds if you want to make a very stretched analogy.

David DeBonis:

Can we say CIQ is your Turnkey?

Alan Sill:

Well, so you're welcome to come over. We have several air connect problems I'd be happy to point you towards. But, it's a dark art. Well, maybe that's too extreme. It's an art. And the interplay between driver versions for the fabrics, different generations of fabrics, fabric interfaces on the same part need the care and feeding of the cables themselves. People don't realize that there's firmware in those cables. There's a little computer in each end of one. So it's not just stringing cables together.

David DeBonis:

Scalability...

Gary Jung:

Oh, sorry,

David DeBonis:

Go ahead. No, go ahead, Gary.

Gary Jung:

I just gotta make a comment. We used to say that every cluster is serial number one.

David DeBonis:

I think that's what I was going to say. It's very dependent on your system and the needs of your system and the workloads that come through it. So it is customized, just like any system would be customized to, let's say, a video gamer that plays video games on their high performance system that is their workstation. Who knows? So, you have to worry about scalability, fault tolerance, a number of areas that cost will be very particular to an installation.

Alan Sill:

Cost is a big factor. So I'm at a university, we build modest scale clusters, and I'm willing to make some controversial statements. I actually never understood people who build clusters at our scale or even a little larger that don't max out the bandwidth on their fabrics these days to the limit of what they can afford. I'll give you an example that two years ago we put in the first university based AMD Rome system in the country, modeled after one that had just been built for the hawk supercomputer in Germany. And shortly thereafter, a lot of these started to pop up. And I was mystified by people putting in HDR 100 connects to dual 64 core nodes, taking giant leaps backwards in bandwidths per core. Anyway I could go on like this for some time, but Matthew hasn't jumped in yet. I wanted to see if I...

Matthew Dosanjh:

I was going to say that the one thing that also is difficult about Turnkey, in a sense, is that these numbers have been progressively getting more and more complex. And that's one of the things that we struggle with. You look at NVIDIA's Melanox or the new Cray Slingshot, and they're providing all this additional technology on the nets, and we're still trying to figure out from a software perspective, how do we utilize this? And that makes it, you add more and more components, the harder it is to have a true Turnkey solution.

David DeBonis:

And tuning those components is an art in itself, right. Not just the sizing and deciding, I'm going to have these particular pieces of hardware and this topology. It's even just getting them performance tuned.

Jonathan Anderson:

I think any exercise in creating a Turnkey HPC solution, and let alone just specifically the network, is going to be an exercise in tuning something for a specific use case. And there's one set of selection that's selecting the components that are usable, DPU's, and things like that, that might be useful in your application or in your suite of applications. But once you have a software stack and hardware stack that works together, and we can just set aside the firmware has bugs and the drivers have bugs, all of this needs to be maintained and kept up to date. Once you have all that, you still have questions, like Alan mentioned, the aspect of cost, maybe one deployment doesn't need dual 64 core CPUs and the bandwidth to feed that. And understanding what an application can actually take advantage of, and then how broad your network needs to be in order to fit that. We'll talk about topologies a little bit, I think, later. But not everyone needs a full bisection bandwidth, non-blocking fabric across all of their nodes, depending on how big their app, how far their application could scale across multiple nodes in multiple cores. Can't hear you, Alan.

Rose Stein:

Oh, Alan, you're on mute or something.

Alan Sill:

Hello? Can you hear me?

Jonathan Anderson:

Yes.

Bandwidth for Medium and Small Clusters [21:38]

Alan Sill:

Yes. I hear that statement a lot, and I think it's objectively true at a rational level, but it's also... I'm willing to argue not true at all from medium and small scale clusters. If you are going to go subdividing the bandwidth of your cluster up into trees and islands you're going to introduce scheduling complications that are especially severe in modest scale deployments. You're going to spend your time waiting for large jobs to schedule to fit into whatever islands of good connectivity you have. And the dual 64 core is old right? It's 96 now and the dual nodes are hard to build. The small half U nodes. I don't think we're going to see any dual nodes for a while. There will be full U units, but nonetheless you, you're not going to have one HDR 400 or you can put one HDR 400 interconnect into that. And you will just have the bandwidth that I have out of my dual 64 core HGR 200s. Right. It's an arms race that's taking place in your back plane. And then the earlier discussion, I forgot to mention that there are other things competing for that backplane bandwidth. Things like hyper-converged system... What do you call it? Composable computing, right? The PCIE networks and everything. 

The weakness I often point out is that's yet another fabric, but those things have to talk to your motherboards and use up bandwidth on them too. So you have to watch the bandwidth per core ratio, the fabric pneuma assignments and a bunch of things. Now, this is the point you made about topology, Jonathan. I think you're right. But I've been building full bisection bandwidth, modest scale clusters for a long time, and I've never regretted it. When you get big, and I put a link in our internal chat that maybe someone could drop into the external chat, the talk that Torsten gave talks about what he... What I like about him is he always makes these declarative statements that you're left to puzzle out. The future is specified, he says, and then he has lots of slides about how you really need all the bandwidths you can get locally. So I'm not sure what it all means, but I will say that the topic you're raising, that of topology, is very important for large scale clusters. But I'm willing to advocate that all medium and small scale clusters, anything less than a few hundred nodes these days, given the core densities, you should spend money to get good connectivity. So, Gary, you raised a question of ethernet topology and ethernet's, of course, the basis of the Cray fabrics and the Cray fabrics have, I think, caught up in terms of bandwidth. Do you have Cray deployments that you can look at?

Gary Jung:

Not that I manage though, but site has a slingshot. And, and you're right. I was going to add to what you were saying. We're just starting to see they're merchants of these multi GPU nodes, and for the people who are training large models, then the communications is a bottleneck. And so even a single 400 gigabit connection sounds a little light for an eight way GPU system if you're doing training.

David DeBonis:

That brings what Alan said earlier too, is your local node and its interconnects, its pneumo nodes, all of the things that are separate memory domains but shared bus for at least the PCI bus. Those are the trade-offs and the trading points when you're on a small system. And probably the focal points of you really want to maximize your node, I think, more than even your interconnectivity. I'm not an expert on that, it feels like you want to do as much locally as possible to avoid that data movement as much as possible.

Alan Sill:

We can look at why dual socket nodes became so popular in HPC. It's almost the default. It's certainly not a bite against universal, but it was partly because of the cost of the interconnect. They were trying to get the most bang for the buck, building clusters out of nodes. And at that time, they thought they could just put more sockets to share the same interconnect and rest the infrastructure. Now, I think it's time to question that when you have 128 cores in a socket, maybe you need a dedicated card per socket. Maybe we should be looking for other ways to optimize cost. And that leads right back into Jonathan's question of topology.

David DeBonis:

I think that's a cue, Jonathan.

Justin Burdine:

So how do you choose an optimal topology? I mean, is there, are there steps you take to dissect the work you're working on? Or how does that work?

Optimal Topology [27:25]

Alan Sill:

So, again, I have to refer to Torsten's talk because I was just really impressed with it. It was only a few days ago that it came out. And there's certainly been other summaries in the past, but it talks about these three types of parallelism, data, pipeline, and operator. And it also talks about the topic that Gary raised, which is the interconnects of accelerators. And it depends on your scale and what you're trying to do. He introduces, at least for me, something new that probably has been around a while, something called a hamming mesh, which has, he says, many configurations. But there are already alternatives to fat trees. There's a popular one that's butterfly, and then there's various types of tourists interconnects.

To me, again, they make the most sense when you have a big enough cluster to make the difference. But for small things like a few dozen GPUs, you may already benefit. For example, if you've just got a few, you can use NVIDIA's envy link and get far faster. And it connects between just a few. So the question comes, how do you connect a few dozen to a few hundred? And I don't have enough experience with GPUs to answer that question. For CPUs, I think I need less than a few hundred. You should just do a factory.

David DeBonis:

It's interesting because Matt's in the center that I used to be a part of as well, and his father, I think it was his father that coined the phrase co-design. But the whole center's concept, and it was a computer science research institute at the time, or computational science. And the whole reason for it was to get both the application developers, the middleware developers, the people that were doing the linear algebra packages, and the people that were doing the system software and the people that were doing some of the underlying hardware or investigating hardware to understand their neighbors really well. And that was the way, for at least supercomputers, that's the way that you get a highly tuned system that can get onto the top 500 for a small system. And, maybe there is a more blanket Turnkey solution.

The Role of Software [29:56]

Alan Sill:

Yes. So, David, you've raised an interesting point in passing here, which is the role of software. We've managed to gloss over it up to now, because I think largely of the work done by a number of people to develop the standards that we rely on now so that we can ignore the complications, so that we have, InfiniBand verbs and MPI. That's a very complex topic that usually does and historically has been in different states for different types of interconnects, I should say it that way. And I think the community has done a great job of coming together and trying to hide that level of complexity from the average user. It's still there, but I think we don't spend a lot of our time thinking about the details of getting the hardware to communicate because of the work of the committees.

Considering Costs [31:01]

Rose Stein:

Can you guys actually expand a little bit more on what you mean by price? Is it literally the price of the hardware? Gary, you mentioned energy. You said it takes double the amount of energy. So when we're talking about how much things cost, what are we considering?

Gary Jung:

I was saying that it took twice as much compute on an ethernet cluster as compared to an InfiniBand cluster to achieve the same result on the lin pack test. But if you're talking about price, I guess usually when you're designing clusters, there's the compute budget, there's a storage budget, and then there's an interconnect budget. And usually the interconnect on the more parallel systems, the more parallelism that you're going to do on the system, the higher percentage of the budget should go to the interconnect. And so it can go anywhere from an interconnect with a blocking factor, say two to one, three to one, even four to one to what Alan called a full section bisection bandwidth, where essentially all the nodes on one half of the cluster could talk to all the nodes on the other half of the cluster without any kind of blocking. Yes. So that's where the cost comes in.

Matthew Dosanjh:

It's also probably worth noting that, when we're talking about networking and InfiniBand and stuff like that, the tables themselves are even more expensive than what you'd think of from a more traditional ethernet network, right? Where the trust is for the network and the compute resources is especially for large systems, which is where I spend my time doing research can be astronomical in a sense. And then you have to worry about power costs and everything like that just to operate the machine.

Alan Sill:

Correct. So let's go back to a point that Gary raised. That typically you'll make connections within a rack on copper, and then maybe between racks or certainly longer distances with optical fiber. Copper cable could cost a few hundred bucks. Optical cable can cost $1,200 depending on what it is, so when I say a modestly sized cluster is easier to build with full fat tree, full non-blocking folded cloth network. If you want to be a computer scientist, that's one of the reasons. The cables just cost you less. And there are fewer switches involved. The bigger the systems, the more... There's roughly a, I mean, basically identically a two to one ratio, but roughly when you fold in storage between the leaf switches and the course, which is the number of switches, grows, the cluster cost of switches is non-trivial. It can be tens of thousands of dollars. And back to another point that Gary raised, the ethernet, sort of pre-slingshot ethernet, classic ethernet can be used to build clusters. And the reason that people were willing to put up with a lower performance, more chattiness, more latency and so forth than ethernet, was that the switches are, and cables are so much cheaper. So that's why, especially a lot of the early Beowolf clusters are built on ethernet.

David DeBonis:

So one of the things too, on cost, I remember when I first started in the group over at Sandia, We were really thinking about fault tolerance and power. It was a really big thing. Resilience from a fault was important because we were seeing that systems were failing because the chip counts were getting up there to the point where we were scaling by the thousands of flops each whatever cycle. I don't want to get into the whole, when are we going to get exit scale thing, but that they were going through, it was always pushed out a couple of years. But it did seem like the fault rate of the actual devices was going to be so substantial that it wasn't, the system would never make progress because it would always have to fall back on a checkpoint to get back up to speed and then go forward. So, when you have more chips, when you have more devices, that means you'll have more failures. And so that's something to keep in mind too. The maintenance cost and the uptime or downtime cost of your system, no matter what the size really.

Matthew Dosanjh:

As you add nodes, your meantime to failure just shrinks drastically. And when you start looking at positioning of your cluster, right? We're at 6,000 feet here, you go up to someplace in the mountains like Los Alamos, it becomes actually significant compared to having something closer to Berkeley, which is closer to sea level.

Cosmic Ray Interference [36:44]

David DeBonis:

It's interesting because cosmic rays, believe it or not, really are a thing. Matter of fact, one of Matt's colleagues did his dissertation on just that. I think on a study with a comparison between Los Alamos and might have been some Sandia stuff, but I can't recall.

Justin Burdine:

So how does that impact, I'm that's really curious. Is that just adding more errors or what is it?

Matthew Dosanjh:

Flips

David DeBonis:

Bit flips. Yep. Wow.

Alan Sill:

And there is, of course, error correction and retry. But overall MPI is still remarkably 

fragile. So, Jeffrey Fox did a lot of work on looking at the comparisons between, for example, map reduce and classic MPI. And looking for ways that the MPI standards could be made more robust and resilient to failure. But pretty much if you lose a node during a calculation, you lose the whole calculation no matter how many nodes are in it. Still, even in spite of this kind of work, I think we need more such work as we build excess scale systems with very large number of nodes to points that the raises become very important. And we still hear, though they've done a lot to tamp it down, we still hear a lot about the reliability of the largest of the scale deployments.

Matthew Dosanjh:

I will just hop in here for a second and say that this has been an ongoing topic in the MPI forum. We're finally at a point where errors aren't necessarily fatal, which we got to maybe two years ago or something like that. But even an out of memory error would just kill your MPI application which seems a little insane in retrospect...

David DeBonis:

I should think, I believe Matt is on the MPI committee, right?

Matthew Dosanjh:

I am the Sandia representative for the MPI Forum and in the open MPI developers community.

David DeBonis:

And I want to plug your conference. Oh, he's also chair, co-chair with another former colleague and a friend for Hot Interconnects this year, which is the IEEE standard, I think is the 29th year...?

Matthew Dosanjh:

I believe so. I'm the vice chair. The chair is Taylor Groves out at Lawrence Berkeley National Labs. Yep. Working in NERSC. And if anyone has papers they want to send over it or we run a free virtual conference. So, there's a lot of good talks and there's no cost to attend.

David DeBonis:

Cool.

Rose Stein:

Nice. Thank you.

Alan Sill:

So, I'll just say on behalf of the community, thank you. Because this is incredibly important work and it does require detailed understanding.

Rose Stein:

So, I have a question, guys, and I would like to hear from every single one of you, if you don't mind. And definitely it's going to change based on what the use case is. But say for example, you're going to build a cluster, let's say for a university like Allen's University, what would be the perfect system? How big would it be? What interconnects would you use?

Building the Perfect System [40:44]

David DeBonis:

I'll go first because I have no idea how to address that. Other than to say my focus has always been on using the power that you have available and performance of a large cluster that is not utilizing its power cap is wasting money. It's throwing money on the floor. So my research that I did was focusing on power efficiency. So how well did an application or a system use its power envelope? And how well could you reconfigure those into a lower power state or into a smaller power window or a longer execution time to help load balance a system. So you can do that on any system and you can make any system somewhat performant. I guess that's a subjective term, but I think that if we start to look at systems instead of how much money can we spend and how much fabric can we have, and how much the new fancy stuff that we have out there, how complex can we make it if we start to look at our workloads and how can we tune them better?

And how can we really get that understanding of how well an application or a workload is utilizing the compartment it's running in. That would be my approach, because I don't know how to answer it otherwise.

Alan Sill:

Rose, let me back up a little from your question then and define the terms, right? When you say university, let's just sort of describe what that environment is. Gary's familiar with this as well. You guys do have a lot of customers in this area. So university workloads are typically highly mixed in terms of they don't have just one dominant kind of application, or if they do, it's maybe due to special funding circumstances of a group. But you have a mix of workloads from very small core counts up to very large. And if you ask yourself, I should have prepared a graph, I have a graph where I show this, the distribution of jobs versus core count is highly peaked at the low end of small core count jobs.

But if you ask yourself at any given time, how many of my cores are occupied by small core count jobs versus large core count jobs after the schedule is done, its work of balancing things out, you find that it's sort of the 10 to a hundred and a hundred to a thousand bins that get the most populated. So yes, there are a lot of very small core count jobs, but each larger core count job takes more cores. So if you ask yourself, how many of my cores are taken up? It turns out that the medium skill jobs, and I think a lot of these environments, will dominate. So then you have to ask yourself, what can I do in designing my fabric storage CPU, to get the most productivity out. That's where the questions I raised earlier come into play.

So, I find with a we have about 750 nodes here and with cluster on that scale having, non-blocking to apologies just helps the scheduler do its job more efficiently and stuffing all the little jobs into the corners where it can between the big jobs. That's not the only way to do it. Look at bridges, it's a couple thousand nodes, I think. They have an oversubscribed fabric with non-blocking in, I think it's not quite per rack. Maybe it's a couple racks expanse. They do non-blocking per rack and then oversubscribe between racks. But that means your schedule has to fit those jobs, those bigger jobs into a rack, right? So, now I say rack, but the densities are such that you really are talking about a switch pretty soon.

And say, you have a 40 core 40 port switch with 20 ports going to worker nodes and 20 to upstream. Then, or maybe even oversubscribed, then 20 nodes at 256 cores a node is a pretty big job. So these are the kinds of things that you have to factor in. And I think if you phrase it that way in terms of workload, then you can get out of the area of just talking about a university and look at other similar workloads. This might apply to a small bio medics firm or to a number of different settings. When you get to the really large scale topologies, that's where the fancy fireworks come out, right?

Justin Burdine:

So, it sounds like really what's driving this is really understanding your workloads and understanding what you're really trying to solve. Not coming out saying, how much can I buy? Where can I, how can I build this out? It really is reverse engineering it. Okay. Interesting.

David DeBonis:

And especially with the rise of so much compute power in a GPU and the need for data movement, I think that changes the whole equation. And I think Alan was mentioning that earlier. It's a really important aspect. You have to understand your community and your workloads. What's the future of the hardware that is coming out next. Because some of these supercomputers what they do is they do a first initial install and then they plan out within a number of years, a couple of years or so an upgrade so that they don't just do that upfront cost every single time. They have a longer term path so that they can get reuse out of them. Knowing what hardware is going to do for the next generation is an important aspect of understanding the system that you're building today and the longevity of that system for future.

Matthew Dosanjh:

I think the one thing that I can add here is that it also really depends on your user base. I was running a panel at XMPI last year, and it was on how we are going to use MPI from GPUs? And we had a very impassioned member of the audience saying why does this matter? In industry when we're using HPC, we only program for the CPU because that's what's most cost effective. It takes too much developer time for a small company to port everything to the GPU and do all that other stuff. And you know that there's a cost balance here of life. How much time am I going to spend writing my code versus how much money am I going to spend on my cluster, like in four hours or whatever. And so then for a university, it's really, you want to tailor it to your expected workloads. And if they need something more than that, there's resources like NERSC and Tach that they can get or that your users can get allocations on.

David DeBonis:

That's a good point.

Alan Sill:

We've also been talking about this just in terms of on-premises clusters, but the same considerations play out in the cloud. And, and the thing that distinguishes them is that, of course, the cloud tries to make its money by selling on demand. And fundamentally that means that they have to deploy at a scale that's sometimes much larger but still be able to partition it up for individual workloads. So there's recently been a paper pointed out by Glen Lockwood, which talks about the InfiniBand structure of the RMA structure of Azure storage, for example. And it's really a tour de force of doing exactly the things you need to do to have a very large scale fabric deployment, but still allows you to do things like take care of security considerations for keeping an enclave of that available only to that user. I'll see if I can find a link to that.

Rose Stein:

So guys, how is it measured? How do you measure the performance of your HPC interconnect?

Measuring Performance [50:00]

Justin Burdine:

And actually, I'd love to add on to that if we could. Because Alan, you were talking about how there are HPC systems that are, are connected differently, right? And in rack we're doing at this speed. So I'm always curious, how would you know where or how would you figure out along with that? How do you monitor it? How would you know where their, where your bottleneck is? That's for everybody.

Matthew Dosanjh:

So we have a lot of different ways of looking at this that I'm aware of at least. There's lint level benchmarks. I know NERSC just collaborated on these large scale noisy networks to identify what we focus on at Sandia is proxy applications which are a small application that does something that looks like science and communicates like science and uses the ram like science, but is easier to understand and doesn't have the test the core function or performance.

David DeBonis:

Matter of fact, the mantiva mini apps are very much focused towards different aspects of computing, different computing workloads. So certain workloads could be compute 

bound, memory bound, communication bound. They could be doing a lot of exchanges point to point, all of the different sorts of ways that you would... It's almost like they're different behaviors of scientific codes condensed down into a small kernel so that you can get these sorts of benchmarks. And some of the work when I was over at Sandia, I did some multi-generational studies and they have a beautiful set of test beds. There, a number of clusters are different, some are heterogeneous and some are homogeneous, they're outfitted with some of the more interesting newer generation stuff that's coming out. And we're able to see at least to a small scale, we'll do a scaling study still. To a small scale 4, 32, 96 nodes, how an application or one of these mini apps, the kernels perform as you start to scale. And usually you can see how an application is going to scale just by a few different doubling your size a few times.

Alan Sill:

So with some caveats that many people don't handle statistics of such benchmarking properly, and that's another whole webinar we could do. But the standard thing to do is to study your scaling and practical terms. Let me point out, if you're going to ask for allocation on a national scale for computer, you're going to need to demonstrate that your application is able to scale. Unfortunately, I have a little constraint on what we're to say. We've actually doing some work within my center with one of the member companies on improving the state of the art for benchmarking of MPI code. But I mean, just go back to what I started with, it's easy to fool yourself with statistics when you're benchmarking. You need to be careful, more careful than most people are when they're doing these scaling studies.

Ultimately, what the scientist cares about is how many units of X workload they can get through. But even simple tools like total view or debuggers can help you look at how you're using your memory, how you're using dial. One of the things we have published from my center, and we've talked about in previous webinars, is that we have open source tools Github.com/nsfcac, that you can use on pretty much any modern system that supports Redfish, to gather out of band statistics on things like memory, power, CPU power. And we're working to fold in these bandwidth statistics so that you can do a touchless overview benchmarking of your cluster how, where is it spending time and effort without having an instrument code.

David DeBonis:

I think Redfish actually utilized the power api, which was a Sandia product that I worked on when I was there. And it's very important to get those out band information and not interfere with the workload or else wait, you're perturbing the system. But it goes back to cost, like you were saying earlier, Rose, asking about cost, it isn't really dollar cost, it's science. Like Alan had alluded to, it's the cost in science. How much science can I get done on this system and how do I get it done very efficiently. So, I can get more science done. It's as simple as that, I think.

Alan Sill:

So Gary, how often do you get researchers coming to you and saying, can you speed up my code versus just, I can't get this to run?

Gary Jung:

It does happen. Usually, most of the issues are these days that I can recall, like saying the last couple of months have been IO related. And so it hasn't been with the interconnect as more with the IO but it does. They do, they do.

Justin Burdine:

So, I guess we're getting towards the end. One of the things that I was curious about is what, because I talked to a bunch of customers who are looking for direction and I was curious what is the state of the art right now if you were to recommend an interconnect? And I know there's obviously depends on the workloads, but what's the state of the art? And Alan, I know you may not be able to say and dive into what you're talking about, but what's, if we were to go off the shelf, what would that be?

State of the Art Interconnect [56:31]

Alan Sill:

Listen, I think it's more competitive than people realize. Like Gary said, most people's starting point is InfiniBand. But let's face it, InfiniBand is expensive and touchy. I've got both an InfiniBand and an Omnipath deployment, and the Omnipath never gives me any trouble. The InfiniBand, I have to be careful with every version update, and there seem to be lots of ways that small communication areas can interrupt performance. I'd mentioned the switch list fabrics the Rockport style ones, you just put a different PCIE card. You don't have either InfiniBand or Omnipath, but it is limited in the number of nodes you can connect. So the great advantage of that is it's very little latency. I mean, essentially, no latency. Well, not none, but instead of PCIE back plan latency at that point.

Then we also have the PCI interconnects, the composable computing fabrics. So you can buy a box of GPUs and shove them in. And depending on your workload, again, if you need those GPUs to communicate or not communicate, you could save a lot of money if you don't want to spend a lot of money on interconnects by just having a box of GPUs. Running them separately. So all of the above. Okay. But I think most people do start out, as Gary said, looking at InfiniBand fabrics and that's a good place to get your feet wet.

Matthew Dosanjh:

I think, what was it, with Melanox now being owned by Nvidia and then Omnipath is spun out of Intel. There's a hardware dependency that is a little bit baked in to some of these.

David DeBonis:

I would say Affinity. I worried about the same sort of thing when Xylinks and Alterra left the marketplace and got consumed by MD and Intel, same sort of thing. They're smart, they know that's going to be the optimization path and they also know that the ecosystem and the people that are out there that can do those things are very limited. Even though Xylinks was trying to create the tools for an ecosystem of design engineers to make their own fabrics. So in a way you're steered, you're always a slave to the industry and the movement in the industry. So your decisions of what you get could have to do with who owns what, which is a sad reality, I guess.

Alan Sill:

So, we're running out of things that Nvidia hasn't bought, right?

David DeBonis:

Right.

Justin Burdine:

Well, going on the path, do we see GPUs as being where this is all heading? Or is there still a need for CPU based compute just for pricing or cost?

Where is it All Heading [59:49]

David DeBonis:

I think it depends on the workload. And yes, we definitely do see a need for that with the 

amount of small compute and large data movement that you see in some of the machine learning codes and training.

Matthew Dosanjh:

Well, and, and it also depends on your users, right? I alluded to this earlier. We have a 

bunch of people working on GPU codes, but there's a lot of people that don't want to deal with that. I don't want to deal with it. Added complexity to their science of having both a CPU show to manage everything and then moving stuff over to the GPU and moving it back, doing your communication. Or trying to figure out how to do MPI from the GPU. But that's in its early stages and has not been standardized. And we're working on that in the MPI forum, but it's not a reality yet.

Alan Sill:

I think it's a little informed by the topologies of the big clusters too, right? We've seen well, let's see. Frontier has four GPUs per CPU, if I recall correctly. Then I think the new AMD CPU plus GPU systems are just being deployed where you have essentially all on a socket or all within a small portion of the backplane. That's great. So my controversial take that I can never get hardware vendors to agree to, is that when they build a national scale, really big system, they should sell little units of that system to a bunch of universities. Just identical, because then we can get the most out of the software development. But I can never get any of the hardware vendors to agree that because they lose their shirts on these big systems.

Justin Burdine:

Well, we are coming up on the top of the hour. Any final thoughts from anybody?

David DeBonis:

Well, I want to thank Matt for doing this at the last minute. I just called him an hour before this to see if he can make it. I'm glad he could. Matt's a good friend of mine here in Albuquerque and it was good to see him

Justin Burdine:

Absolutely.

Matthew Dosanjh:

Thanks for having me.

Justin Burdine:

Sure. Love to have you again. This is great.

Rose Stein:

Thanks Matt.

Justin Burdine:

Any other final thoughts? Everybody else? Good? We talked it out.

Rose Stein:

Talked it out. Well you guys, make sure that you like and subscribe. We are here every single week, same time, same place, same channel. We will be here If you guys also have suggestions on topics for webinars, things that you want to dive into, please leave those in the comments. We want to read them and we want to serve our community the best that we can. So thank you everyone on the panel for being here. Thank you for watching. You are very amazing. And we'll see you next week, same time, see you guys.