Achieving Scalability in HPC for Complex Workloads and Beyond
Scalability is a fundamental factor in High Performance Computing (HPC) systems, enabling organizations to handle massive and intricate workloads while remaining adaptable to changing requirements.
This webinar will examine how HPC systems empower organizations to rapidly process workloads. We will also highlight their limitations when facing workloads beyond their initial scope.
Webinar Synopsis:
- What Is The Difference Between Cloud Scalability And Physical Scalability?
- What Is Infrastructure Scalability In HPC?
- How Do You Plan For Infrastructure HPC Scaling In The Future?
- What Information Goes Into Making HPC Plans?
- What Does HPC Scaling Offer An Institution?
- Does Using Containers Affect Scalability?
- What Softwares Can Help You Analyze HPC Utilization?
- What Are The Limits That Impact Scalability In HPC?
- What Are The Best Practices For HPC Scalability?
- Is Cloud HPC Scaling Being Used In The Industry?
- How Can Fuzzball Help You Plan And Analyze For HPC Scaling?
- How Does Automation Help Maintain An HPC System?
- What Makes System Administration So Hard In HPC?
- Adapting Warewulf To Suit Specific HPC Scaling Needs
- How Do You Balance Price To Performance When Building An HPC System?
- How Long Does HPC Infrastructure Last?
- How Do You Identify Bottlenecks That Affect Scalability?
- Can You Automate Bottleneck Discovery?
- How Do You Account For Cost Of Storage As Data Grows?
- How Do You Know What Resources An Application Needs?
- What Tools Are Used To Build An HPC Cluster And Scale Up?
- Does The Use Of Containers Affect Scalability?
- Is Disk I/O Less Of A Limitation Today?
- What Can You Use For Data Collection And Monitoring At Scale?
Speakers:
- Rose Stein, Sales Operations Administrator, CIQ
- Brian Phan, Solutions Architect, CIQ
- Jonathon Anderson, Senior HPC System Engineer, CIQ
- Gary Jung, HPC General Manager, LBNL and UC Berkeley
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Full Webinar Transcript:
Narrator:
Good morning, good afternoon, and good evening wherever you are. Thank you for joining. At CIQ, we're focused on powering the next generation of software infrastructure, leveraging the capabilities of cloud, hyperscale and HPC. From research to the enterprise, our customers rely on us for the ultimate Rocky Linux, Warewulf, and Apptainer support escalation. We provide deep development capabilities and solutions, all delivered in the collaborative spirit of open source.
Rose Stein:
All right, cool. Hey everybody, thanks for joining us live on CIQ's YouTube page. We are so excited to be here wherever you're watching. Welcome, welcome. My name is Rose Stein. I work at CIQ and we are here to talk about scalable computing. And so what's actually really fun, and this is probably my favorite part of the job, is when will seem like, oh, what exactly is that? Is it this or is it that? And then I go and I ask any of the engineers, in our Slack channel, and there is differing views even on the same topic. And that is probably the most exciting part to me because then we get to define what it is that we're actually talking about here and really dive into the importance of this particular topic. So Jonathan, thank you so much for showing up. It's good to see you.
Jonathon Anderson:
Good to see you too, rose.
Rose Stein:
Awesome. And Brian, welcome.
Brian Phan:
Hey, good to see you. Good to be back on the webinar. Rose.
Rose Stein:
Yeah, it's really good to have you. Okay, so the first thing we might want to just define, because when we're talking about scalable computing, there seems to be just a couple different ideas out there on what exactly we're talking about. So let's hone in and really define what it is that we're going to be talking about today on this webinar.
Jonathon Anderson:
Yeah, so there's like the discussions that we had in prep were divided in like two different contexts. There's scalability of the infrastructure that you're running on. Oh, oh yeah. Hey Gary.
Rose Stein:
Hey Gary. Welcome.
Gary Jung:
Hey. Hi.
Rose Stein:
Okay, cool.
Jonathon Anderson:
Yeah, there's scalability of the infrastructure that you're running on. Gary, I think we're getting echos from you. There we go.
Rose Stein:
Maybe I'll mute too while you're talking.
Gary Jung:
Hey,
Rose Stein:
I'll too while you're it's you Gary, Gary, you're echoing in. You're doing like a
Gary Jung:
Yeah, I hear a lot of echo. I'm not sure why. Sorry about that.
Rose Stein:
When you talk, it actually sounds fine, but then when we're talking it sounds weird, so Yeah.
Jonathon Anderson:
Well.
Rose Stein:
Just have you on mute until we get to you because you sound fine when you're talking. So hang on a second here. Gary. I'm glad you're here though.
What Is The Difference Between Cloud Scalability And Physical Scalability?
Jonathon Anderson:
Yeah. So you can talk about the scalability of the infrastructure you're running on and two areas that we were discussing that in. Where there's scalability from an on-prem infrastructure out into the cloud, let's say. And that's one way to look at it. Another is the scalability of your physical infrastructure and how easy it is to expand that infrastructure. Both of those are predicated on the assumption that there's this imagined infinite pool of resources that you can grow into, either by deploying new equipment locally or utilizing public cloud infrastructure. The other major part of the conversation that we had was about applications being able to scale out and use the infrastructure that you have available. And so within HPC we usually see this through the use of MPI codes to do distributed memory computing. But there are others like in the high throughput computing space, you can just do embarrassingly parallel algorithmic work or there's other alternatives to MPI that we've been using or working with recently is PGAS and the PGAS net protocol as part of the chapel programming language and others.
Rose Stein:
Awesome. Thank you for that, Jonathan. So what is it, Brian, that we are going to be focused on today in the topic?
What Is Infrastructure Scalability In HPC?
Brian Phan:
Today we're going to be focusing on, I guess in terms of scalability, we can talk about scalability from an infrastructure perspective. And I'd like to add one more point to Jonathan's point. So in the CFD world, there's also scalability like your model as well that you're running. So if you have an extremely coarsely meshed model and you distribute it across a lot of nodes, the performance will taper off the more cores you throw at it. So if you have a bigger problem, you basically have a finer mesh with a lot of points that you need to run simulations on, throwing more cores at that finer meshed model will yield better scalability. But for today's topic we will talk about scalability of the system and software. Yeah.
Rose Stein:
Did you have something you wanted to add to that, Jonathan?
Jonathon Anderson:
One of the things I was thinking as Brian was discussing like the scalability of a model is that often, and forgive me Brian, if I'm reiterating just part of what you said, but when you talk about how scalable a model is, you will often see a point of no return or limited returns on it. And so when you're assessing how scalable a given code or a given model or a given algorithm is often you're talking about how far you can scale it before the communication, let's say the work of doing the scaling becomes dominant in the computation and you're not able to scale to a larger system at that point. So both of these are important in the general conversation. There's no point in having a larger system if your code can't take advantage of it. And there's no point in having a scalable code if you don't have a system large enough to accommodate it.
Rose Stein:
Thank you. So it works together. So it sounds like a bit of planning, like forethought of knowing where you're going might also be helpful and what it is that you want to do with your system and your code.
Jonathon Anderson:
Yeah, absolutely. In environments that I've worked in in the past it's been like a general purpose university or other, there is a system and it's used by a variety of different researchers and research areas. And that sometimes makes it difficult to do that kind of planning. But in the best case, when you know the kinds of applications that your researchers are going to run, you can see how scalable those applications are and design a system directly around that. And then if you have software developers working on that algorithm, working on that application they can take specific advantage of the characteristics of the system that they're on. So it's always a back and forth and a planning exercise in the best case
Rose Stein:
Gary is back.
Gary Jung:
Yeah. Yeah. Can you hear me okay?
Rose Stein:
Yeah. All better.
Gary Jung:
That's great. Yeah. Okay. Sorry about that. Yeah, for some reason I just need to reboot my machine.
Rose Stein:
You say for some reason, but it happens often with all of us. Like just gotta like, work out the kinks. I mean, the brain is the same, right? As being a person. Like sometimes I just need to take a step away. So, anyway, glad that you're back, Gary. You want to introduce yourself and talk about how scalable computing relates to what you're doing at Berkeley.
Who Is Gary Jung?
Gary Jung:
Sure. Yeah. My name is Gary Jung. I manage the institutional high performance computing effort over at Lawrence Berkeley National Laboratory. And I also manage the HPC program down at UC Berkeley.
Rose Stein:
Wow. Never a dull moment in your world, huh?
Gary Jung:
No, no.
How Do You Plan For Infrastructure HPC Scaling In The Future?
Rose Stein:
So we were actually just talking about thinking ahead, right? When you're thinking about scalability, like not only your infrastructure, your system, your software, like all these things need to work together. And the idea that we need to think ahead, right? So I imagine that LBNL and UC Berkeley are similar, but have some differences. So when you're thinking about their systems and their compute systems, like what are some of the similarities and differences in terms of their forethought in scaling and maybe even some of the trouble that they've gone into not thinking ahead enough.
Gary Jung:
So for us I usually think of scalable systems as building blocks. And I don't know, and I don't want to repeat anything. I don't know if anybody has said this already,
Jonathon Anderson:
But I think you're going in a new direction. It sounds great.
Gary Jung:
Yeah. Yeah. We think of it like an infrastructure that you can scale out using building blocks. Yeah. And so usually what you have to do is there's a lot of planning in the beginning in terms of how big you might eventually want to see the overall structures. Because then it really depends how you're going to set up the backbone for the Infiniband fabric or how much you're going to scale. So usually the questions are like, ultimately how far you want this to scale or how much do you want to plan for scaling it out so that you don't limit yourself in terms of how you set up your infrastructure. So that's usually the way we set this up. And what we do is we tend to build the two different environments.
The Berkeley Lab and UC Berkeley, the workloads are different between the two. Maybe one way I can generalize it is saying that Berkeley Laboratory does large team science. So the projects tend to be bigger and the codes tend to be more let's say capability based because they're larger projects and sometimes have larger compute requirements. But the way we think about building the systems is still similar and the systems just may differ in terms of whether you scale more network into it or how much performance you build into the system. So I'll leave it at that, but when I talk about scalability, we're talking about building something with building blocks and ultimately how you designed the infrastructure for that.
What Information Goes Into Making HPC Plans?
Jonathon Anderson:
Rose, if I may ask a question here. So when you're doing that, it helps to know that you have different known and maybe not well-characterized, but at least some characterization of a differentiation between those two environments. So that speaks to being able to look at that workload and plan the environment around it. But how do you, what variables go into your expectation of how far you'll want to scale a given environment before it's replaced by a new one? Is that a look at historical trends and how the compute load has grown over time? Or is that an individual exercise with the researchers that are on it? What does that look like in your environment?
Gary Jung:
You know we've done it mostly by historical trend because a lot of times you're thinking about how much capacity you're going to have. So we do gather research and input, but we provide such a support for diversity of research that it's hard to build for any one. So we tend to build more generalized infrastructures. So then it has to be pretty good in most dimensions than if you're going to build for a diversity of research and applications. So then the things that are taken into consideration are budget, things like physical limitations such as data centers, space and power, that might be another consideration for scaling. Yeah, they tend to be more of those types of things. They're more of their constraints than let's say the system performance.
Jonathon Anderson:
Fair enough.
What Does HPC Scaling Offer An Institution?
Rose Stein:
So this is really cool, Gary, I'm so glad that you were able to show up because it's like you are here with this real world example of the things that you can do when your system can scale. And then at CIQ, like we empower people, right? Like, we're like, okay, like I've been on calls with Jonathan talking to different universities and different companies and it's like, okay, well what is it that you want to do? Right? And okay, like, let's structure it this way so that we know where we are going and what it is that you'll need and Brian and Jonathan and helping build some structure for people so that they can scale it the way that they want to. So I'm so glad that you're here, Gary, and even Jonathan. Brian, do you guys want, could you give some examples of things that are possible because we have been able to scale at such an incredible rate?
Brian Phan:
I think scaling from a capacity standpoint, from organizations that are in the automotive or aerospace industry, a lot of their simulations they run typically run for weeks to maybe even a month. And their on-premise system will usually fill up. So with the ability of being able to scale to the cloud, they're able to run their simulations there and get their results. And as a result of that it's possible that they can build their, manufacture, their product and actually get to market faster by leveraging the cloud.
Jonathon Anderson:
Yeah. The best case scenario is that like there's two main things. You either do more work at the same time, or like, you either get your result faster or you get more done at the same time. Right? And the part of that that I am most interested in personally is that getting your work done faster because, but that's like from a research background, I think. I think in a production mindset, if you're in manufacturing or something like that where you're doing a lot of the same work and it's about throughput, that's one interest. But what I like to see is that the computation is only one part of the research cycle, right? You put some work into a system and you get some data out, and then humans think about what happened and what they saw out of the computation or they learn something from it. And the faster you can get that iteration the more you get to have human innovation be part of the equation. And so the longer it takes for you to get your result, the more time the researchers spent waiting and wishing that they knew what the next step was going to be. And so that's where my interest in this is driving that driving that time to solution for the next iteration of innovation and interest down.
Does Using Containers Affect Scalability?
Rose Stein:
Gary, I imagine too, and this is one of the things I see at specific universities is what's popping into my mind, but I'm sure it's other places as well that people are wanting, especially either. It's like, well, we and most of our professors are wanting to really utilize the resources of our compute system and be able to like, have our own containers and do things. So when I'm thinking of scalability in that sense, it's like there's more and more and more and more and more people wanting to utilize the resources and run jobs. Have you seen that as something that you've had to work on too, Gary?
Gary Jung:
So you're saying using containers then?
Rose Stein:
That's the Yeah. Specific example that I am thinking of right now in terms of scalability. How does that relate to it where there's more and more people utilizing the resources of the system, like being able to tap in and run jobs?
Gary Jung:
Yeah, you know we recently built a large system which is a hundred node system for the Joint Genome Institute that's based out of Lawrence Berkeley National Laboratory. And they used to run their workloads at the NERSC, national Supercomputing facility and recently moved a lot of their work over to our institutional systems and we built them a specific system for this. So one of the things about supporting the life sciences work codes such as at the Drug Genome Institute is there's so many different codes and there's just so much work in building each building the environments to support each one. And so we took a different approach when we built this cluster in that we have it running containers.
So most of the people are building containers and submitting the jobs to run as containers on this cluster. And it makes scaling a lot easier because not only can they run on our system, but they can essentially run on any system running container. Because now we've essentially in the process of getting all the users to convert all their applications over. So now they're going to run on a new cluster and then because it's all container based, it can also run other places too. And so they're going to be running at other national laboratories and other computational resources they have access to. So they are scaling in effect outside the institution because they have an easy way to do that.
Jonathon Anderson:
Rose, I really like this idea of thinking of scalability as a factor in computing accessibility. And you can think of like the negative degenerate case of I have this machine right here in front of me and I can do a certain amount of work on it, and the other computers that the rest of you are using are inaccessible to me and I'm not able to scale my work out to them. One of the things that I really enjoyed doing at in a previous site as we, we built what we called a, a condo cluster, where historically at this institution, which is not atypical in universities, you'll have a researcher come in and they'll have some initial seed funding or grant funding or something like that, and they'll buy a small cluster maybe a couple nodes, maybe let's say 16, that's not an abnormal size for that an environment.
And their researchers will be able to use those instruments, those resources, and they can scale up to that size and that's the limit. But our goal in building this condo cluster was to try and co-locate that as a social exercise. So we would help them identify what equipment to purchase and make sure it was all part of one scalable infrastructure. And it was incredibly heterogeneous. I mean, we often think of scalability in this environment that you have a given set, you know Gary was talking about building blocks and that you have like a uniform building block and whenever you scale up you have more of that exact same kind of infrastructure. This was the opposite of that. Each cluster was almost bespoke, but then it was all part of one shared infrastructure. And the best case scenario there was, we had one customer that had real money to buy real equipment but they didn't have as much experience standing up a cluster.
So they brought it to us and they said, Hey, this is what we want to buy. Is this something you could help us manage? And we did, and we did it as part of this single infrastructure and we got it up and running and two months before they were actually ready to take advantage of it. In that time, that system was used, I think it was something like 93% utilized for those two months, even though the original customer hadn't even touched it, because the people that were already part of the infrastructure were able to scale out into the other resources that had been added to it. And so I think it's an interesting point that scalability isn't always just about the performance that you can have with the equipment you purchase or what you have made available to yourself or within your enclave within your group. But it's about making as much of the resources available to as many of the people as possible. And once you have that, computers that sit idle aren't useful to anyone. And the more you can utilize them, the better.
What Softwares Can Help You Analyze HPC Utilization?
Rose Stein:
Yeah, I like that. Thanks for that real world story. And it wraps back into Apptainer or other software that helps people scale in the various different ways that we're talking about. What are some other software that you really like that is helpful for that as well? Everyone's kind of like looking up thinking, Brian, you got one.
Brian Phan:
I like Fuzzball.
Rose Stein:
Well that's a good one. I think you're familiar with that. So Fuzzball actually uses containers as well, correct?
Brian Phan:
Yes, yes, that is correct.
Rose Stein:
So how could that help companies scale and what part of scaling are we talking about that Fuzzball can help them do?
Brian Phan:
I think with Fuzzball Federate, I think from a capacity standpoint it could help orchestrate an organization's workflows on, I guess the optimal hardware or at least where there is capacity. Yeah. Yeah. And I guess it would depend on what the organization's doing. The particular use case I'm thinking of is genomic sequencing where you're just trying to get the throughput of sequencing as many samples as possible to get if you're looking for something like VCF or something. Yeah,
Jonathon Anderson:
Yeah. And part of that is that it's just the next step in this entire conversation of making the scalability available and accessible to people. And Gary was talking about the benefits that he's seen with people around him in having their workload containerized and running it that way locally meant that they were portably able to scale beyond the local environment because they can take the work that they did and prepare and run it in other environments as well. And that's true pervasively with containerization and in particular with containers in the HPC space. But that's the exciting part about Fuzzball to me, is that it makes it even easier to take that work that you containerized. That's not a whole new way of doing things. It's building on that experience that we've had. And the benefits of containerizing it, making it portable to a number of environments, including Fuzzball, but not even exclusively Fuzzball.
Rose Stein:
Awesome. Awesome. So it sounds like there's a lot of different things that we can do and feel free to correct my terminology and my words, but like the idea is that you can scale up so you have more like compute power within the infrastructure that you already have or you can scale out or you actually have more computers, servers, nodes, clusters, whatever it is. There's also different software and applications that you can use to speed things up, just as you guys were saying. So your data that you get back instead of it taking a month to process, it only takes a week, an hour. Okay, maybe I'm going a little wild here, but right. Shorter and shorter time to get the results that you are looking for. And you can do more. And then there is also the scaling in terms of allowing more people to have access to the resources that are available. Does that kind of sum it up?
Jonathon Anderson:
Yeah, I think so.
What Are The Limits That Impact Scalability In HPC?
Rose Stein:
That's the basic gist. Awesome. So what are some of the limits that impact scalability in HPC? I mean, we touched on this, but just specifically.
Jonathon Anderson:
Yeah, and my experience mirrors what Gary was talking about that often it's about power, cooling and budget more than anything. But my experience is similar in that it's in that kind of a general purpose environment where you're trying to build something for the common case that makes it available to the most people. In theory there are also limits on what your algorithm or your coder are able to do, but power cooling and budget tends to be where you actually find your limit first these days.
Rose Stein:
Yeah. So you can do anything with the right amount of money, huh? That's good to know. Brian, did you want to add to that?
Brian Phan:
No, I think those are, those are the main limits that I've run into from my experience as well.
Gary Jung:
Yeah.
Jonathon Anderson:
Although what we've also talked about is just access to the equipment too. When things are partitioned out, when you don't have the ability to get your code onto it, like from the user's perspective, that may be even a bigger thing. We would run into people all the time that didn't even know that we had an HPC infrastructure and we're running things just locally on a machine. So the same conversation we've been having applies there as well. Sorry, Gary, for speaking over you.
Gary Jung:
Oh, no, no, I was just agreeing. I was just agreeing. The thing that people run out of is money usually. And but there are times where certain codes won't scale beyond a certain number of nodes or something and it really has to do with the type of code it is. And so there are limitations to certain applications.
What Are The Best Practices For HPC Scalability?
Rose Stein:
All right. What are some best practices, Gary? You've got a lot of experience in different fields. You do a lot. So what can you share with us about best practices in terms of scaling in the variety of ways we've talked about?
Gary Jung:
Yeah. You know just planning for it and then planning for scalability is important. So thinking way ahead about how you normally buy your first system and you're thinking about a lot of times people are thinking ahead like, okay, I'm going to start with the system and it's going to get bigger when I get more funds. And so there's always a decision point about how much should go into the infrastructure and how much should go into the resource itself. And so you have to make sure you, you, you don't limit yourself in terms of what you choose for the infrastructure. So you don't make decisions that are going to limit you later in terms of ultimately how large the system is going to scale. So there's a lot of architectural decisions that have to be made when you make a scalable system.
The other thing that people don't pay as much attention to is the performance of a scalable system. And so a lot of times people will start off with a small system and maybe they'll benchmark the small system and they'll say, okay, great, I'll just buy more of these. So they'll just assume that they could just buy more of these and put them together and it's all going to work, but it is important to go back and benchmark and make sure that you are getting the performance that you had intended out of the large system. So that's a step that a lot of people don't pay attention to. And so what we'll get large systems in and will benchmark them and the vendor will say, well, we have tons of these in their field. And I'm like, well, how come nobody's seeing these problems? And they are real problems that we find. And it is just that people don't do the due diligence to check the performance of the overall system. And that's worthwhile to do because just because you want to make sure you get your money's worth, you can deliver the performance that you thought you were going to get.
Rose Stein:
Right. Yeah, that totally makes sense that say you got all the money in the world, but if you just keep adding and adding and adding and buying more and buying more, but it's not working optimally or efficiently or the way that you expect or need or want it to, like you can pile things on and it's not going to do you any good.
Gary Jung:
No, no. Yeah.
Rose Stein:
Yeah, that makes sense. I do want to say, because we are live, so if you guys have questions, feel free to pop them in and we'll have one of our people put them up on the screen here so that we can answer your questions. I mean, we are here, we are definitely live and we are ready to be helpful.
Gary Jung:
I was just going to say something Brian mentioned something about scaling out to the cloud. I do want to make a pitch for that. Depending on the application, we found a lot of applications that need to scale or the demand for it is just at certain times then the cloud works out really well. I mean, you have simple tools like the ability to like say auto scale with the cloud instances that makes it perfect for like, say doing a release of new data or something like that where everybody's coming to the site and they'll just overload your on-prem. It's hard to design an on-prem system for something like that where you're going to design it for the peak load. And so that's a perfect use case for the cloud is that you can build an infrastructure to do that on a cloud. So I would bet a lot of the people doing streaming video movies or whatever it is online, it's going to be busy around like the holidays and then people are off and then quieter other times. So it makes sense for people like that to use the cloud.
Rose Stein:
Yeah, thank you.
Is Cloud HPC Scaling Being Used In The Industry?
Jonathon Anderson:
Do you see much of your computation scaling out to the cloud, Gary? Or are you seeing that more as a service offering when you have things that are consumed more by a general public audience?
Gary Jung:
Right now a lot of stuff on the cloud is not used for scaling out, but it's really more for getting things up quickly. And it just has to do with when you start scaling out, then the cost of the resources are expensive. And we don't have as many use cases where people just have a short project and are just needed for a short amount of time. So then it's more cost effective for us to bring it back on-prem. Yeah, we, we'll see that people will start to scale out and then they'll see the cost and then we'll bring it back on prem.
How Can Fuzzball Help You Plan And Analyze For HPC Scaling?
Rose Stein:
Hmm. So it would be really cool if there was some software that could know your on-prem resources and know your cloud resources and delegate these things appropriately based on what the job is.
Jonathon Anderson:
That would be cool, wouldn't it?
Rose Stein:
I think that would be cool. Brian. Yeah. What do you think?
Brian Phan:
I think Fuzzball Federate should be able to do that. Yeah. Reach out to us if you're, if you're interested.
Rose Stein:
That is definitely something that we come across, right? And in just total organic conversation it's like, aha, there's a problem, let's solve it. So yeah, that's definitely one of the things that Fuzzball can do and is doing. Yeah.
Jonathon Anderson:
It goes back to the accessibility conversation. One of the biggest impediments there is that often running in the cloud is operationally very different than running on local infrastructure. They're culturally different, they're structurally different. Managing them is different. And so one of the goals of fuzzball is to make that more similar. That if you have a user workload that's already containerized in the same way you would do with the standard Apptainer environment, you can take that workload and run it in the cloud or on-prem and it feels exactly the same and you don't have this portability problem of taking your workload that works in one environment and now you have to convert it to another because the cost benefit analysis shifted a different way.
Gary Jung:
Yeah. I'll add one more thing to that and I think one thing nice about Fuzzball Federate is that if you're using resources at multiple institutions for example, then you don't have to worry as much about the points of integration between the two infrastructures of the two institutions then because Fuzzball Federate just sits at a level that's a little bit higher, doesn't require all the low level integration that you would when you're sharing data and stuff between two institutions.
How Does Automation Help Maintain An HPC System?
Rose Stein:
Yeah. All right. You guys can keep going on this topic if you want to, but I'd love to hear some more best practices or strategies as practices that you've come across that you'd like to share.
Jonathon Anderson:
I was going to say automation, because the least scalable thing in your infrastructure is the people. And so the more that you can start out anticipating that future scaling and using your smaller system to iterate on getting things automated, getting things replicable, if you don't do that at the beginning, then as things start to scale up you're going to have a mess of one-offs and by hand configuration. And so we've had a lot of benefit there historically through the configuration management practice with a number of products there within CIQ and I personally tend to propose stateless management of your infrastructure whenever possible. And so that's how the Warewulf system would help manage this scalability of having a single node image that gets deployed out to all of your compute infrastructure. And then there's work going on to try and unify those node images that you might deploy locally with what you might see in a cloud environment as well. So just get as much as you can, as simple as possible so that it can be automated, replicated and scaled out and up whenever necessary.
Brian Phan:
I think a lot of times users hit the limit of not knowing what they are able to do on a system and to address that, just having good documentation so that the users can help themselves and that would lighten the load on the administration team, so the administration team can actually focus on building the users a better system.
Rose Stein:
Yeah, I like it. And this is all really good and I imagine that any admins out there are thinking about this, or like, thanks for the suggestions, guys. That takes time.
Jonathon Anderson:
Yeah. But, but honestly, like we're talking about products that we have here at CIQ that people might be interested in. There's also just consulting and services that we're able to offer. And if any of this sounds interesting, but you need help still get in touch because whether it's with one of our products or not, we have a lot of HPC experience here and we would love to work with you, love to talk with you about your infrastructure, what problems you're having and how we might be able to help you with a solution.
What Makes System Administration So Hard In HPC?
Rose Stein:
Yeah. Agreed. All right, guys, is there anything that you want to add? I think we covered most of all the things that we wanted to talk about in terms of scalability. I don't believe that there are many questions from you guys. We must have made you speechless with all this amazing information. So I just really want to thank all of you guys for showing up. Do you have any closing remarks, Gary?
Gary Jung:
No, but I did appreciate that Jonathan brings up the fact that system administration is hard to scale, so it is worthwhile putting in the time to make sure that you use practices that make the best of your labor. And that was originally the motivation for building Warewulf is that the original project that was used on was at Berkeley Laboratory, we had 10 standalone systems, and we said there is just no way that we're going to do 10 standalone clusters and manage them all as individual compute nodes. And so where it was used for that back in 2003.
Rose Stein:
Hmm. That was a little while ago, man. I think Warewulf has changed a bit since then.
Gary Jung:
Yeah. Yeah, that was the original one.
Rose Stein:
That was the OG?
Gary Jung:
Yeah. Original.
Rose Stein:
Yeah. So what version of Warewulf are you guys using at this point?
Gary Jung:
I guess we're deploying the current version on our newer infrastructures. We had Warewulf version three, but was modified with a lot of the hand modified to have a lot of the features of version four. So I'd say we have an unofficial version 3.5 on some of our infrastructure.
Jonathon Anderson:
Sure.
Adapting Warewulf To Suit Specific HPC Scaling Needs
Rose Stein:
Yeah, I hear that. Jonathan you probably hear that a lot as well.
Jonathon Anderson:
It's the nature of working on Warewulf while you're using it that you're always on an unofficial fork of a little bit more than what's currently available. One of the things I've really enjoyed about it is that it's a relatively accessible code base. So we're interested in getting people invested in the community and involved in the community for sure around Warewulf. But it also is a really flexible tool just in your environment. It's easy to modify, easy to manage.
How Do You Balance Price To Performance When Building An HPC System?
Rose Stein:
Yeah. Yeah. Love it. Okay, it's definitely question time, so put them up. So we got off on the Warewulf tangent, we were talking about other things, that's definitely something we can help you guys with too. All right, Igor, what's up Igor, thanks so much for being here. Do you always build the best and fastest servers, and how do you know what the capacity will be in a few years?
Jonathon Anderson:
I'll say that when I was last doing this calculation, what we were looking for was performance per watt. Generally speaking, we were looking to see, like when selecting a CPU it was either budget dominated and then you were looking for performance per dollar, and maybe that would mean you spread out to more systems or power and thermal constraints. And then you're looking for performance per watt. I expect Gary, you have thoughts on this as well, and probably more than I do on predicting capacity in the next few years.
Gary Jung:
A lot of times you look at performance per dollar but one of the things I, I do pay attention to is that we're not necessarily picking the fastest processors, for example, yeah. For the servers, a lot of times limiting factors and memory bandwidth. So we try to get maybe a little more conservative on a processor but we make sure we get the best memory bandwidth because once you start taking the bandwidth and dividing it by the number of cores on a node, then it ends up then you can see how much bandwidth you're getting per core. And that's a really important number for a lot of people. And so that's something to look for when you do that. Go ahead. Go ahead, go ahead. Okay.
Jonathon Anderson:
On a more subjective metric, the question is the best servers, and often you'll look at a given vendor and there will be different tiers of skew. And in HPC typically what I've seen is that you don't deploy the top tier enterprise big iron server. You're looking for where the economies of scale are. And these days, those tend to be the servers that are targeted at the hyperscalers that cloud scale things. So what's pretty typical these days is a form factor that's four nodes crammed into a 2U chassis. And pretty much every major hardware manufacturer has those, and I don't think anyone would consider that the best server in a given manufacturer's product line, but it is the most cost effective, both from a power and cooling perspective and a literal cost perspective. And that's where most HPC, the general middle of HPC, tends to target in my experience.
Gary Jung:
Yeah. Yeah. One thing I will say is that when you're designing infrastructure, sometimes it does make sense to purchase at the higher end for the infrastructure because the infrastructure tends to outlast the compute nodes. And so that you have to think about making sure that it's going to not only work for this generation of processors, but maybe the next two generations of processors as you add them into a system.
Rose Stein:
Hmm. He said a few years, so.
Jonathon Anderson:
Yeah, the last question had a bit about how do you plan for what will be available in the next few years or what will be required in the next few years? And yeah, we talked a little bit about that. There's historical trends in terms of what's available. You maintain pretty close relationships with your vendors and their suppliers to understand what's coming down the pike.
How Long Does HPC Infrastructure Last?
Rose Stein:
But Gary, did you say that the infrastructure would outlive the nodes? Is that what you said?
Gary Jung:
Yeah.
Rose Stein:
What does that mean?
Gary Jung:
Usually in a lot of settings, but this is especially true in an academic research environment, people have a limited amount of budget to buy a system, and so they will buy it with the intention that they will buy more when they get more money. And so if you're going to do that and scale it out and it's going to last for a long time, you're going to use it for years for you're not planning to retire the whole system, you'll have an infrastructure of like a Infiniband backbone and storage that's going to go across your compute. Those things, it's worthwhile paying attention at a higher performance because it's going to have to scale for one. And then the other thing is, it's going to have to last more than the life of the original compute nodes. Otherwise it'll be really expensive to buy compute nodes and infrastructure every five, six years or whatever you use for your compute.
Rose Stein:
So is that typically how long a compute node lasts is five or six years?
Gary Jung:
Yeah.
Jonathon Anderson:
That tends to be the longest that manufacturers will warranty them for and service them for without special exception
How Do You Identify Bottlenecks That Affect Scalability?
Rose Stein:
Cool. All right. We're ready for George's question. Thank you for being patient, George. Okay. How do you identify the bottlenecks that affect scalability?
Jonathon Anderson:
Run an example application and observe its behavior really and run it at different scales and see at what point that performance curve tops out. If you run it on twice as many nodes, do you get twice as much performance? If not, which things are filling up? Gary, what, what's your experience here?
Gary Jung:
Yeah that's the way we do it and monitor it. Yeah, so you just have to have a lot of monitoring in place so you can see where the what is running out of resources.
Jonathon Anderson:
Do you do that from a systems perspective typically, or do you do it with an application level profiling?
Gary Jung:
We do it at a systems level, not at applications unless we're having a problem with something specific, then we'll start digging in.
Jonathon Anderson:
I feel like I'm always looking for better tools on this front, and I've never been super happy with my process here. It always feels a little bit up in the air and just a research project every time.
Can You Automate Bottleneck Discovery?
Rose Stein:
Hmm. Is that something that you can automate as well?
Jonathon Anderson:
The collection of it? Yes. But the determination of it tends to be a manual process. At least that's my experience.
How Do You Account For Cost Of Storage As Data Grows?
Rose Stein:
Yeah. Okay. Oh, Dave Dickerson. What's up man? How do you account for either tiers of storage, offline storage, or tape or low cost S3 storage providers as data grows? What has been the best case realistic scenario that you've seen?
Jonathon Anderson:
So the best thing that I've seen that incorporated tape as like a data archive for users was one where that process was transparent, because people are bad at managing their data between different tiers and my experience. And there are some exceptions to that where it's very explicit, like you're working on a project and that project is done, but the data must be retained and that project gets archived off. So if you have really strong policies around that thing, you can do that. But historically what I've seen is tiered storage really only works when it happens in the background and the user doesn't have to think about it. And that's from a cost perspective, where less recently used storage goes off to tape and people don't have to worry about it, but then they can get it back transparently or from a performance standpoint where the application just uses the data in a certain way and the system's like, perhaps I should move this to fast storage. And it does that for you as well. People will take advantage of resources that you give them, or a certain person will take advantage of the resources that you give them very specifically and intentionally, but often for the general case, it needs to be something that's automated.
How Do You Know What Resources An Application Needs?
Rose Stein:
Sylvie, Hey. Hey. Hi. I've been looking forward to a webinar about this. Yay. Perfect topic. How do you know if an application needs a fast low latency interconnect like Infiniband or more memory or more cores? This is a good question. I think Gary, you know the answer to this because you've probably come up against this.
Gary Jung:
Yeah, but I probably could use help from other people too. I mean, you have to just run benchmarks and test it. And so usually if it's a core or memory thing, that's something you can usually figure out on a single node run. And then but then if you are running multi-node, this is just a very simplistic answer, but if you're running multi-node and there's a lot of communication that's going to show up in the benchmark when it doesn't perform as well. And it could be due to the fact that you have to go outside the node and that's when a low latency interconnect will make a big difference. But it's really a matter of benchmarking with the code. What's usually nice is if you can run the code somewhere, if you know somewhere where the code runs well, and then you can try to see where the performance differences are between what you currently have and where it runs.
Jonathon Anderson:
Yeah, needing more memory is relatively easy to discern. But Gary, you mentioned earlier that you maybe more recently have seen issues with memory bandwidth and that being a bottleneck. How do you discern that when you're benchmarking a code? What are you looking for there?
Gary Jung:
People can try running on different nodes where there's a different number of cores and different speed and different memory bandwidth. So for example we were able to just figure out one because it had different memory bandwidth speeds to two different nodes. What we would do is we would, this is thinking about scalability is that we're thinking, okay, we don't have enough money. We want to buy as many compute nodes as we want as we can. But maybe one way we can do that is that we can buy it with the memory half populated and we'll buy larger dims. So rather than say getting so we'll get, like the dims are going to be twice as big to get the same amount of memory, but we won't populate all the dimm slots.
But then by not populating all the dimm slots, then that means that we're not going to get the same memory bandwidth out of that node. And so we had somebody that said, oh my code seems to run slower on your new nodes than the older ones. And it was because the memory bandwidth was essentially about 60% of what you could get out of one where you had all the dimm slots populated. So what we did was we just populated the dimm slots and it bumped up the memory bandwidth, and then the person reran the code and then they said, oh that runs fine now. But it was one of these trade offs where we wanted to essentially scale out as many nodes as we can with a possibility of adding more memory later. And then someone noticed like, oh that doesn't perform as well on their particular code until we actually got to full configuration.
Rose Stein:
Awesome. Great answer you guys. Thank you. All right. So I have been notified, we've got three questions left, so let's do it. Let's get all these questions answered. Okay. Oh, I'm going to do my best to pronounce your name. Peter?
Gary Jung:
Peter.
What Tools Are Used To Build An HPC Cluster And Scale Up?
Rose Stein:
Peter. I like it. Okay, good. What tools are used to build a small HPC cluster and scale up.
Jonathon Anderson:
In various capacities, we would point you to Warewulf for that in most cases. Yeah, there's a lot of history behind the project, A lot of experience in building it up. We think it's a pretty good example of best practices right now and yeah.
Rose Stein:
Yeah, I love that. Maybe a little Apptainer in there as well.
Jonathon Anderson:
Yeah. But that's not so much in building the environment that that's how you would then run things on it, so, absolutely. But in terms of how to build up that environment start with Warewulf and then you can then run things without a containerization solution, but there's a lot of benefits to just starting out the gate with containerized applications for sure.
Does The Use Of Containers Affect Scalability?
Rose Stein:
Yeah. Awesome. Okay. Sylvie, great. Also, how does the use of containers affect scalability?
Jonathon Anderson:
So the only thing I'd say here is there are some challenges to be overcome. Just, again, it's more of a cultural issue on doing distributed memory computing in containers. We've had a lot of success with that recently, but there's a documentation issue of getting that information out there and, and helping people to run NPI applications with Apptainer. That's something that's made easier in Fuzzball but is possible even without it. But from a performance perspective, there shouldn't be anything. Gary, have you seen any limitations or concerns with trying to scale containerized applications out through any overhead or anything like that or other issues that you've had?
Gary Jung:
No, you know we've had people move to more Python based codes. With the initialization of all the files and everything that was hitting our file system really hard. And for a while we were telling people they should containerize it because then all the files are all contained in the container. And then that actually eased up the load on our I/O system quite a bit. So, it allowed us to scale up more by having people move to containers.
Is Disk I/O Less Of A Limitation Today?
Rose Stein:
Awesome. Oh, mb what's up? Are you seeing disk I/O being less of a limitation these days with the flash disk arrays and these faster interconnects?
Jonathon Anderson:
Gary, have you got Flash I/O at your site?
Gary Jung:
We put in a VAST system. We put in a few of these and they're all flash and they work really well. And it does solve a lot of problems. I don't think that just the fact that it's flash by itself is necessarily the thing that makes the biggest difference, the system matters also so that, because we've had other flash systems and so it's a combination of things for the I/O that makes it faster. As people are going to start to be doing now everybody's interested in doing these large language models and they're going to require a lot of I/O and the interconnect is going to the path between the I/O with the interconnect, that's all going to be really critical. So not just the flash itself, but you're going to want to look at the system, the interconnect supporting things like GPU direct those are all going to be really critical coming up.
Jonathon Anderson:
Yeah.
What Can You Use For Data Collection And Monitoring At Scale?
Rose Stein:
Awesome. All right. I think we're pretty much up. Oh, still one more. Okay. What do others suggest for data collection and monitoring at scale?
Jonathon Anderson:
I don't know. I've never been happy with my monitoring solution.
Rose Stein:
Brian, you got a good one?
Jonathon Anderson:
What do you guys use?
Gary Jung:
Datadog, I guess, was one that I've used in the past. Yeah, that's just from my experience.
Rose Stein:
Okay. Gary, so you keep saying get the data. How?
Gary Jung:
Yeah. Grafana. Prometheus, yeah. But probably more important is that you do something with the data than you just get it. Okay. A lot of people get stuff and they get lots of data and they don't do anything. They never look at it. So make sure you do something with it.
Rose Stein:
Do something. So do those programs that you were just saying, do they analyze it at all for you as well? Or they just gather it for you?
Gary Jung:
It's mostly gathering. You have to look at it. You have to go back and look at it and that's where the real work is, I think.
Jonathon Anderson:
Grafana gives you the tools to look at it, but it's still a human process to derive meaning from all that data that you collected.
Gary Jung:
Yeah.
Rose Stein:
Yeah.
Gary Jung:
It doesn't do any good if you just gather it all up and then file it away on some disk and nobody looks at it. Might as well not do it. Yeah.
Rose Stein:
Yeah. I think that's what I do with my budget.
Jonathon Anderson:
The thing to say here is to take the problem seriously. Like as your system scales up, so does the data generated from it just as an operational concern. And you have to treat that with the same weight and the same value that you do the applications running on the system. There's work to do there.
Rose Stein:
There is work to be done and as Jonathan was saying earlier, we are happy to help you, right?
Jonathon Anderson:
People with better monitoring ideas than I can help you in CIQ.
Rose Stein:
Not Jonathan.
Jonathon Anderson:
And I'll be learning right along there with you.
Rose Stein:
Okay, awesome. So yeah, if you go to our website, CIQ.com, you can get lots of information. It's easy to contact us, just send a little note through and I will get it and I will reach out and we'll figure out whatever it is that you need and how we can help. I mean, a lot of the stuff that we talked about today is open source, so you can go play with it. Warewulf is open source. You can access it on our website, you can Google search it. So is Apptainer, the same thing we do to provide the professional support behind that. We talked a little bit about Fuzzball today, so we got some little ears tingling like, what is this? So definitely reach out to us. We would love to have you see you, help you, support you, hear what you're working on, and if you have ideas about future webinars that you really want us to address a topic, let us know in the comments, we read all of them. So like, subscribe, share with your friends. We love you so much. Thank you so much for being here. Have a great day. Bye.