CIQ

Warewulf: Deep Dive, Use Cases, and Examples

July 7, 2022

Webinar Synopsis:

Speakers:

  • Jonathon Anderson, HPC Engineer at CIQ

  • Dave Godlove, Solutions Architect at CIQ

  • Zane Hamilton, Director of Sales Engineering at CIQ

  • Gregory Kurtzer, Founder of Rocky Linux, Singularity/Apptainer, Warewulf, CentOS, and CEO of CIQ


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Good morning. Good afternoon. Good evening. And welcome back to another webinar with CIQ. We appreciate you joining this week. Today we're going to talk about Warewulf. We have a new release of Warewulf that just came out and we wanted to talk about some use cases, do a little bit of a deep dive. So we have Jonathon and Dave here today. Welcome.

Jonathon Anderson:

Thanks.

Zane Hamilton:

I know most people have seen you guys on here quite often, but why don't you go ahead and just reintroduce yourselves. Start with you, Jonathon.

Jonathon Anderson:

Yeah. Thanks, Dave. My name's Jonathon Anderson. I'm an HPC engineer with CIQ and have a background in HPC and cluster sysadmin.

Zane Hamilton:

Dave.

Dave Godlove:

Hey everybody. I'm a solutions architect at CIQ and my background is in, first, science, neuroscience at the National Institutes of Health, and after that HPC admin at the National Institutes of Health, also I've got some background with Apptainer.

Zane Hamilton:

Excellent. Thank you, Dave. And I think we also have Greg. There he is. Welcome Greg. Would you like to introduce yourself?

Dave Godlove:

The special surprise guest.

Zane Hamilton:

Surprise guest.

Gregory Kurtzer:

I'm not sure how special or surprising.

Zane Hamilton:

Call it a cameo, whatever you want to call it.

Gregory Kurtzer:

Hi everybody. I'm Greg.

Zane Hamilton:

Welcome. Thanks for joining Greg. So Jonathon, I know we've spent some time talking about this and you have some pretty interesting stuff to show us. We got Warewulf 4.3 that's just been released, which is really exciting. I know there's some stuff in there that is going to be new for those Warewulf users on version three. So why don't we dive right into it and give us a little bit of an overview of what Warewulf 4 is.

What is Warewulf? [1:39]

Jonathon Anderson:

Yeah, great. And as I go through an overview of any of the stuff that we're going to show today, Greg, feel free to jump in and correct anything that I say. I'm still relatively new to Warewulf, but I'm really liking it as I get familiar with it, especially the latest version in version 4. And I've been playing around with 4.3. So there's some new stuff there as well that I will highlight. So, if you haven't used it before, Warewulf is a stateless cluster management provisioning system. At its most low level, it manages and automates the standard suite of Unix stateless provisioning services that you might use. So it does DHTP for you. It does NFS. It'll manage your SSH keys for your host keys and your administrative login. And then TFTP for initializing your PXE remote stateless boot.

It provides a set of commands for managing stateless images, either one for your whole cluster, or you can target different images for different groups of, or single nodes, and then has a system of overlays that allow you to customize the configuration that each node receives and then periodically update that over the life of the node. Some things that, to me, differentiate version four from what I understand to be from version three–and I was using what was called Perseus before… I think that that was effectively Warewulf two. Is that the right way to think of it, Greg? Yes or no?

Gregory Kurtzer:

Kind of. Perseus was a–I wouldn't even call it a fork–it was another provisioning system, which we developed originally when IBM asked if we could create a different provisioning system for use with xCAT. So we started developing Perseus under a different license, a non-GNU copy license. So Perseus existed for a little while. IBM ended up doing their own stateless, inside of xCAT, and we then just rolled that right back into Warewulf.

Jonathon Anderson:

Fair enough. So yeah, things that are new in Warewulf 4, where previous iterations of Warewulf and its like were based around this standard idiom of a chroot that you build up from a base image, and then you log into, you chroot into it through commands and then modify it to be what you want. And then when you exit out of it, it'll build a little image file and ship that off to nodes when they boot. Warewulf turns its head a little bit sideways in version four and says, wait a minute, that looks a lot like a container and rebases that mindset onto the existing ecosystem of OCI containers, and on disk, it's not that much different, but it lets you take advantage of a lot of the developments that have happened in that space since then. Warewulf 4 also does all of its configuration, I think, as static files and YAML files on the file system, as opposed to a database–this is my preference. So it's a development I'm glad to see. But it also makes it just a little bit more transparent what's going on. And if you want to script and edit those configuration files directly, you can. We'll show that a little bit, not doing it, but where the data goes.

One thing that people have been asking about and maybe we should leave it, but we'll acknowledge, if you have an existing Warewulf 3 environment, there isn't really a clean upgrade path from Warewulf 3 to Warewulf 4, just because they are so different. My personal recommendation is to deploy another Warewulf alongside it. And at that point, since everything is stateless, moving things from one to the other should be relatively straightforward. That's something that we'd love to work with you on, if that is an environment that you're in, something that is a concern for you. The action is not that big of a deal, but trying to script it in a way that is correct for all people is a different matter. That is an administrative action, at least for now, but shouldn't be too onerous once you get familiar with Warewulf 4, because in my opinion, from everything that I've seen, it's been very easy to get up and running with, and it just works the way you expect it to, which has been nice.

Zane Hamilton:

Very nice. So I guess I'll ask this question to Greg, since he loves this topic of conversation. Are there things that we should take into consideration as we go through an upgrade from three to four, and what are those things? What should we be looking at? Oh, I'll stop. I'll ask my other parts of that question a minute.

What to Expect with Upgrades [6:26]

Gregory Kurtzer:

Yes, absolutely. Going from Warewulf 3 has a very different… well, let me back up. To provision a system, it's more than just blasting a base operating system onto a bunch of compute nodes. You actually have to manage some amount of configurations and files that differ between nodes. So for example, network configurations is an obvious one, but also service configurations and whatnot. So Warewulf, in addition to being a provisioning system, also has a very rudimentary but specific type of configuration management system associated with it as well. Now, couldn't we just use Puppet or Ansible or something else there? And the answer's generally yes, but there's an exception, which is when we provision a system, some of these configurations actually have to be written to the base operating system. They have to be written as a configuration before ASP.NET is called.

So this means that before we have a chicken and egg problem, before we actually are provisioning and actually booting the system, we already have to have it configured, which means Ansible and Puppet and Chef and other things would actually come later, after the system has been booted. So we have a little bit of a race condition we have to deal with. So we have to be able to do some amount of configuration management initially before the system boots or as it's being provisioned. So Warewulf does this between version three and version four differently, and that's something that would have to be considered. So if you are currently using version three and you are provisioning out a lot of files that are custom, for each one of your nodes booting, that system is a little different in three versus four. In four, we use something called overlays and in three they're just individual files within the database, within the data store. As you're going from version three to version four, you do need to consider what that infrastructure looks like, what those configurations look like, and move those over. And there's not an automated way of doing that, at least not today. Now the second piece of this, or rather two pieces, is the operating system layer and the kernel. Both of those are generally pretty easy to go from three to four. So files, provisioning of files, is the tricky part.

Zane Hamilton:

So is that something that, whenever we talk about tricky and difficult, is that something that we would consider probably the most difficult part of this process? How do you deal with that?

Difficulties with Updating [9:07]

Gregory Kurtzer:

It just depends on how much people have relied on that. I'm going to pick on my previous employer insight, Berkeley Lab. They have used Warewulf version three to provision out, at some point, I think the number was 30 separate clusters, sub-clusters under one giant management paradigm. So you can imagine with 30 separate clusters, how many separate files they had to manage and whatnot. And I think at some point it was well over a hundred separate files they were managing with Warewulf and provisioning. Converting that to Warewulf 4 is going to be a lot more difficult than if it was more of just a base install. A base install is actually going to be really easy to convert over, but it really depends on how much customization you've put into the system.

Zane Hamilton:

Excellent. Thank you. So I know, Jonathon, you wanted to give a shout out real quick before we went too much further.

Jonathon Anderson:

Yeah. So we're going to be doing some demoing today. We've got a Warewulf install that we're using for some internal testing. But I wanted to acknowledge and thank that the hardware that we're using today is something that's been provided to us by Dell, at their Dell HPC and AI Innovation Lab. We're doing some work together to demonstrate and evaluate a bunch of CIQ projects and efforts there, but one of them is some Warewulf and OpenHPC work. And so we'll be using that today. So I just wanted to say thank you to Dell for providing the hardware that we're using.

Zane Hamilton:

Great. Thank you, Dell. Yeah. I'll let you dive into it. 

Explaining Warewulf Usage [10:52]

Jonathon Anderson:

Okay. So we'll start from here. I’m just on a head node here; what we have is a head node that we have Warewulf installed on, and then it's on a private network with four compute nodes. I've torn out part of the cluster for now, because we're going to rebuild it as part of the demo. You won't see four nodes right up front, but we'll get started with this. So Warewulf's base configuration file is where you would, I think, expect it, at this Warewulf.com file. And with this, we can see those initial services that I mentioned at the front. So we have some really initial stuff about Warewulf itself and the network that it's on, but then we can see that it has stanzas for each of the primary services that it manages.

It can optionally, we see it's enabled here, but it can run a DHCP server and then we tell it what IP address range to use. One thing that I didn't understand when I first came to Warewulf, because this is different than the last cluster that I worked at, is DHCP is really only used for that initial node provisioning and node, I say discovery, but discovery means something specific in Warewulf, when it's first finding a node and telling it what config it should have. The idea is right now, as I have this, it can provision up to 10 nodes simultaneously, but once they're booted, those 10 slots become free for more nodes to come up. So, this is how many nodes that you can be booting at one time. We run a TFTP service for handing off PXE, and then we also run NFS.

This is probably what I would consider the most optional part of this. The Warewulf server here is acting also as an NFS server. And so the home file system and the op file system here are being exported out to the compute nodes. And because it knows that it's serving those out, it can also optionally mount them automatically on compute nodes. All of that could be handled in a different way, if you were running an NFS server some other place, and for most clusters, I expect that would be the case. You wouldn't want all of your NFS to be dependent on this one system, but I expect for many clusters and many different sizes, this would be great. And it's certainly a good way to get started. All of these services are configured with... so I'm going to go ahead and become root here so I don't have to do everything. The wwctl, all Warewulf operations, go through this command. Configure command manages these root services that Warewulf also delegates responsibility out to. So, it's not an NFS server itself. It's not a TFTP server itself, but it understands that these services exist and can generate the configs for them and manage those service restarts, depending on how we've configured Warewulf.com. So you can update the config for each of those services individually, or you can just do all of them, which is what I would tend to do. So pretty quick, low impact operation. The biggest one is probably restarting NFS if you've changed exports and that kind of thing. But again, NFS, it tends to be stateless, so that should be a relatively low impact thing, even if you've got a cluster going in most instances.

Zane Hamilton:

Jon, does that NFS, is it version dependent? Does it have to be three, have to be four? Does it matter?

Is Warewulf NFS Version Dependent? [15:07]

Jonathon Anderson:

I think, and right now the one that I'm using is just exporting using the default settings for the operating system; you could specify what version to export using the standard NFS semantics here and these export options. I think by default, it's exporting both. And then, on your mount options, you then decide whether you're mounting with version three or version four, whatever your client will do by default.

Zane Hamilton:

Thank you.

Jonathon Anderson:

The next thing and the most obvious thing is the nodes that you are managing with your Warewulf cluster. So we can see the list of nodes that are currently there with node list. I have three nodes here because I removed the fourth one so that we can add it back in for the purposes of our demo here. But, like I mentioned, all of these commands are really just touching config files and editing them. We can also see those nodes in etc/warewulf/nodes.conf, which is just the AML file that shows our nodes and what config exists for each of them. We see that they have an IP address specified and a Mac address for DHCP and nodes one through three are here. We can add...

Zane Hamilton:

See one time last week, Jonathon, I don't know if it really matters right now, but the hardware address, Mac address, needs to be case sensitive.

Jonathon Anderson:

Oh yeah. I think that's actually fixed in version 4.3, but I don't know for a fact. Now, we recently had a customer issue come up where they had specified the Mac addresses as all caps and the DHCP server wasn't noticing that. I think that in the latest version of Warewulf, it LCases this when you insert it. These are correct because I use Discovery to populate them. I didn't add them in myself. So yes, a good thing to be aware of, but hopefully also not needing to be aware of anymore. So, the next thing to do is to add our fourth compute node that I happen to know exists, and we'll go ahead and set some basic things about it. This is setting an IPMI address that is already configured on the node. Actually Dell configured those for us before they handed them to us, so that was nice to just go ahead and have those already, and then we're specifying what IP address we want it to have. It's just the next one in the range. We're going to market discoverable, again because I dislike typing in Mac addresses, and we'll show off that feature. And then this is C4, and it's asking us if we're sure and we are. So now if we say node list, we'll see that C4 is there. And if we look at our nodes.conf, we see that we have a C4 node here now, but because we did not specify it, we don't have a hardware address like we did before, but we do see that it's discoverable. And so that means that when Warewulf sees a node come up and ask for a configuration, when it sees something come up over DHTP and try to PXE, if it's not a node that it knows exists, that is, it will look for the next node in the config that is marked discoverable and configure it as that node.

We'll go ahead and turn that node on. Because we have IPMI configured for this node already, one of the things that I have not mentioned is that Warewulf also provides a front end to IPMI. So we can just turn this node on. So, I guess this is going to take a second...

Zane Hamilton:

Trying to decide if it's actually on or not...

Jonathon Anderson:

It's not unusual. There we go. So now it's on, and we can also see–we're not going to watch this because it can take a while–we'll go ahead with the demo, but we can get the status of our nodes. I mentioned earlier that nodes can be updated periodically based on what's in their overlays, and we'll go into that in more detail in a moment, we see that C1 through C3 have been seen within the last minute, and they got the most recent version of their overlay, but C4 has not been seen yet ever because we just created it. And so we could, a thing that I will not infrequently do, is watch this and wait for it to come up. But these are recent AMB CPU nodes, and they take a little while to boot. We're going to leave these behind for now, and look at other things while we wait for that node to boot.

Something else that we can see with that node list command is the properties of it. So that same command that we use to get the list of nodes can also see things that are defined for the node. And one thing that's different here from what we saw in the YAML is we see things that are defined that weren't in that YAML, like the cluster has been defined here and, let's see some other things. What type of IPMI interface we have, or the username for IPMI, those are defined here, and where those are coming from is this default profile that's been defined for our cluster. The profiles are like a genericized or templated node, and we can list them very similarly to nodes. 

We will see which ones…We have two, I'll go into the fuzzball one in a moment, but we can also list the attributes for our profile similarly to nodes. With profile list -a, you can look at our default profile and with this, you can group nodes into sets of common configuration. Like I said, we have the same netmask for all of these nodes, because they're in the same network and the same IPMI interface type and username. We can override this local to the node, but for things that are the same, we can put them here. We also have the container that these nodes will be using to boot as their image, but first I want to go into the idea of these overlays, and this is what Greg was talking about earlier as Warewulf's configuration management system. The system overlay is the collection of things that get overlaid onto the node before wwinit runs, so during boot, and then runtime is what will be applied to the node. I think it's every 60 seconds is how often the nodes will check in. This is done as efficiently as possible. So these are small to begin with. It's usually just a relatively small collection of text files, but then the set of overlays that are for each node are compiled together as a single tar.gz, so that it's as efficient to transmit across the wire to a collection of nodes as possible. So we can see those overlays and the one that I've most recently made is the Slurm one, because this is a Slurm cluster.

We can look at the overlays list, overlays that are defined. We have a few here for setting up NTP, for setting up Slurm, wwinit. It sets up Warewulf itself that comes with the system, but then we can also list the files that are in our overlay. And so, if you do Slurm management, it's what you'd expect. We have a slurm.conf and a munge.key for authentication between nodes and the Slurm cluster. This is pretty straightforward from a Slurm admin perspective, if you've done that. You just dump these files in here and they get populated to the rest of the nodes in your cluster. We can also edit them directly here.

Zane Hamilton:

What you're doing now leads into another question that Mark asked, real quick, Jonathon. It said, "Can or should an admin actually touch that node.conf file or should they use the Warewulf control command to configure those nodes?" Does it matter?

Configuring Nodes in Warewulf [23:57]

Jonathon Anderson:

It matters in the sense that if you touch it directly there's a chance of malformation. Maybe you do it wrong and you put data in there that isn't what Warewulf would expect. I think that there's a Warewulf command for parsing the config file and seeing that it's in a good state, but I haven't used it if it does exist. Let's see real quick. I don't see it right now, so that's a danger, but that's always a danger when you modify something by hand. The other thing to know is that Warewulf will take care of some operations on the back end when you use its commands to change things. I mentioned that it will compile the overlays when there's been a modification. If you add a node, it will build the overlay like that compiled version of the overlay for you when you add it. But if you edit it by hand, then it won't know to do that. You have to keep that in mind and do those things yourself. So you can say, for example, “wwctl overlay build,” and it will build all of the overlays for the nodes that you have. You can do that manually if you've made changes that Warewulf doesn't know about, but if you do it through the command, then it knows. And so it knows to do those things.

Zane Hamilton:

So technically you can. It just might not be the best idea.

Dave Godlove:

Quick follow up on that, I guess, either for you, Jonathon, or for Greg. Is there any functionality that is not exposed through the CLI that you would need to get into the conf files and edit directly?

Jonathon Anderson:

Not for the conf files that I've seen. There are some things that are certainly more straightforward to do with overlays by editing the files directly, but that's more of a stylistic question of whether we want to try and support absolutely everything that you might want to do to an overlay through that command line or whether we should be supporting doing those things directly on the file system. So overlays for example, are in var/lib/warewulf/overlays and they're just directories on the file system. And so there's that same Slurm overlay right there. You can adjust this as you would expect, but where it gets hairy is UID and GID mapping because, the UID and GID in your container and on this head node and on the node that boots, you want to try and keep those in sync. So understanding what UID and GID or overlays should have is a thing you have to keep in mind when you're creating them. You don't just want them to be root or any random user that you happen to be on the head node.

Dave Godlove:

And you mentioned compiling these, and maybe I missed it, but do these bare directories end up being converted into an image format?

Warewulf Directory Conversions [27:02]

Jonathon Anderson:

It's just a tar, or I guess it says image. I was thinking it was a tar, maybe it's something else, but it's ultimately just a file system of all of your overlays squashed together and then gzipped for efficient transfer of the wire. Here we can see that my overlays have been compiled, I'm using that word, but they've been packaged up and turned into an image for my C1 node and in the sets of kinds of things that they might want to be. Some of these are left over from previous configs. These are the different sets of overlays that I have once applied to this node as a collection and in this order.

Dave Godlove:

And then I guess it's really just OverlayFS that handles on the nodes.

Jonathon Anderson:

I don't think so. I think it just copies the files. Am I wrong about that, Greg? Or is it using something like OverlayFS to bring these in or is it just copying them in?

Gregory Kurtzer:

So it's actually using, it's layering initial RAM discs is actually how we're bringing it in. So if you look at the etc/warewulf/ipxe, and I think it's just a default iPXE file, you'll see how each one of these different layers are basically just coming in as initial RAM disks.

Dave Godlove:

So this is strictly additive. You can't use this to delete files. Okay.

Jonathon Anderson:

Yeah. That's correct.

Gregory Kurtzer:

That's actually an interesting question, because they are cpio archives, as they layer, you could remove an entire directory tree by replacing it with a file that's empty. I wouldn't suggest it, but you could totally break things that way. As a matter of fact, the amount of times in debugging that I have broken the system because, in my overlay, I put a binary at /bin/foo. Think of a shell script at /bin/foo. Well, Jonathon, if you don't mind real quick, do an “ls -l /”? Yeah, there you go. You'll notice /bin is not actually a file or directory. It's a link. So by doing that, I completely knew all of that link. And I basically replaced that link with a directory that had one file in it and I broke all my paths and nothing booted. Nothing worked. The amount of times that I've done that and the amount of times that you think that I would be more optimal in debugging that, and not pull my hair out for hours at a time, trying to figure out what just went wrong and why my system isn't booting, you'd think I'd learn faster than that.

Jonathon Anderson:

That is really interesting to think about, especially knowing that it is not doing what I thought it was doing in copying files in, knowing it's overlaying something. That makes sense now in retrospect, but as you've described, it would be a very easy thing to not think about.

Dave Godlove:

Good to know how the overlays actually work. Yeah.

Using Slurm with Config [30:08]

Jonathon Anderson:

I mentioned my basic slurm.conf here. This is what's in my Slurm overlay. Honestly, SchedMD has a slurm.conf generator on their website that you fill in your cluster bits, just so I didn't have to think about it. I did that and then dumped that config in here. So you can do something as simple as that. If you want to do a little bit more complicated things, the Warewulf overlay system also supports templates. If we look at the generic overlay, which came with the system, we see several files in here that have this .ww suffix, and those are our Warewulf template files, and we can see what's in them, and they have some simple templating semantics in here. So here, this etc/passwd is going to include the etc/passwd that came with the container and then also the etc/passwd that was on the head node. Something else that you might encounter is the host file, which will be automatically populated based on the nodes that Warewulf knows about, and it uses the same system actually to configure the head node itself, as nodes are added to the system, so that you can manage nodes’ presence in all of the host files everywhere without necessarily needing a fully featured DNS system for all of that.

So we can go back to our node that we are booting and see that it is now running. We saw it 23 seconds ago, and presumably several times before that and it is now in the same state as the rest of them. So we've added a node to our Warewulf cluster and configured it and booted it. And it has booted exactly as we would expect. We can also see, if we go back to look at that node's configuration, that ‘discoverable’ has been flipped to false because the node has been discovered and it's hardware address has been populated because when that node appeared on the network and asked for a configuration, Warewulf assigned it to the next discoverable node.

I kind of highlighted for a moment that we have multiple profiles on the system now. We have this default one that's booting the nodes that are configured for it into a pretty typical OpenHPC compute environment, but I've been working on this Fuzzball profile so that we could easily transition nodes between an OpenHPC deployment and a Fuzzball deployment, Fuzzball being CIQ's vision of the future for HPC 2.0, a fully containerized orchestrated multi-cloud and on-prem federated do-everything HPC and otherwise compute system. I was hoping to have more of this available for demo today, but I don't have my internal Fuzzball cluster done yet, but I can demonstrate what we would do to attach it, and it's this: all you would need to do is set the node to be a part of the Fuzzball profile and then power cycle it.

We're not going to do that today because it's not ready for it, but we can show, for example, that of the containers that we have, there's this OpenHPC Slurm compute container, that is the image that nodes will use when they're booting up and becoming part of the OpenHPC environment and this Fuzzball substrate container. And we have one node over there because we switched it over to that profile. And just like the default profile, we can set whatever attributes in the Fuzzball profile that we want. Where before I had a Slurm overlay attached to it, this also has a Fuzzball substrate overlay, which provides the configuration for the Fuzzball agent on those nodes. I'm pretty happy with what we are expecting to be able to do with this and be able to recharacterize nodes at will with nothing more than a profile change and reboot. We've just got to get that cluster provisioned internally. So hopefully we'll be able to show that in the future. Oh, we have a question.

Zane Hamilton:

Again, Mystic Knight. Welcome back. Thank you for the questions. So, "any discussion around security of head nodes, spoofing head nodes, and audit/alert if managed config change on a node would be appreciated." I don't know anything about security on head nodes. 

Security of Head Nodes [35:31]

Jonathon Anderson:

Ultimately, as is relatively typical in an HPC environment, I would consider the head nodes something that must be trusted. TFTP is not a particularly secure service. Neither is NFS in most configurations. These are our private networks in this environment and you certainly should be deploying your Warewulf communication across a secure, isolated, dedicated, and trusted network. The closest thing I can think of is Warewulf setting up SSH, and then you would see those keys change if something was spoofed, but nothing around the kind of TFTP DHCP, anything like that, that I've encountered. You would just need that to be on a trusted network and that would be part of your security posture. Any other thoughts about that, Greg?

Gregory Kurtzer:

No.

Node Containers [36:36]

Jonathon Anderson:

Alright. I did want to go into...so the higher-level thought here is that one of the big wins for Warewulf 4 is this: it is not the profiles, but the containers, that Warewulf node images are, or can be, they don't have to be built this way, but can start their life as a standard OCI container. And so, I wanted to show how simple that is and how getting used to working in that realm simplifies things even down to the user level. So we've seen our two containers, Fuzzball substrate and OpenHPC Slurm compute. Those are just, like look at how simple these are. 

I started with our upstream Warewulf Rocky 8 container. These are a little bit different from standard OCI containers in that we want them to be more like a full operating system rather than a minimal set of services. Among other things, this installs a wwinit and installs a kernel because Warewulf 4 will detect, or at least more recent versions of Warewulf 4 will detect the presence of a kernel inside your container and use that to boot the node, rather than having to specify that out of band. But, as with any OCI container, I can start from there and then just do stuff to it. So these are things that you could shell into your container to do, but if you do them in a container file like this, you can manage it. You can put it in configuration management; you can version it. You can keep track of what it meant to boot your node image and do it predictably, maybe as part of a CI/CD pipeline, anything like that. So for my Slurm compute container, I added kernel modules, because I want to be able to use Apptainer, which needed SquashFS. So I needed kernel modules because it's a module in this kernel. I installed the OHPC repository, just like you would otherwise, enabled the power tools repository, because it depends on it, installed some packages like Slurm out of there and IMOD and NTP service and then started those services. Enabled Slurm to start on boot and MUNGE to start on boot. That's all there is to it. Similarly, my Fuzzball substrate container file looks even simpler than that.

I copied a Fuzzball substrate RPM into my container and installed it and enabled it and that's all there is to it. And that's your node image. You don't have this state that evolves over time that you don't remember what happened in the past. My recommendation, what I would prefer to do, is just build your node images this way. If you need to make changes, update this file and then build a new one. That first image is up–it's on Docker hub, frankly. And so pulling down those is as simple as well...

Zane Hamilton:

You started generating questions here, Jonathon.

And talking about containers has sparked some questions. Mark asked, "Can one use a sign or encrypted Apptainer in Warewulf 4?"

Encrypted Apptainer with Warewulf 4 [40:34]

Gregory Kurtzer:

Not directly. Right now, Warewulf is basically using OCI, and the Apptainer support and Singularity support for signed and encrypted containers is not directly OCI. Warewulf right now is predominantly using OCI to move all these containers around and Jonathon's talking about in terms of how easy it is to leverage containers, leverage work that's already been done and to be able to do additional things on top of that and be able to prescriptively manage and define all of that in your recipe files is incredibly valuable. And the goal of this was to integrate with that existing container community. There are some container CNCF and OCI capabilities coming out that are going to be more around sigstore and I believe Notary v2 is still on the horizon, among other things. We're hopefully going to be able to leverage those and then hook directly into CI pipelines to create those containers and then import them into Warewulf. So you can, again, have your whole CI pipeline go and create these containers, these operating system images, and Warewulf will pull them in and then automatically deploy them for you. So again, this is like bringing CI/CD into scalable cluster management.

Jonathon Anderson:

If you do have an existing container or Singularity container, though, where you have a workflow that's based on encrypted and/or signed containers, there is a conversion mechanism. Apptainer supports sandboxing for containers. So you could rebuild that existing container alongside it as a sandbox directory, which is just an extracted directory of the contents of that. And then that can be easily imported as an image into Warewulf. There is a path to get that into Warewulf as an image, but you'll break the consistency of it being signed and encrypted. It will no longer be signed or encrypted when it is resident within Word.

Dave Godlove:

You'd have to be careful to do that rebuilding process in a trusted environment.

Jonathon Anderson:

Yeah, exactly.

Gregory Kurtzer:

You know, it would actually be broken in either case because, even if Warewulf could handle that encryption or decryption of that image, it would have to decrypt it to disk because the part of the PXE process that would be basically taking or ingesting these images would not be able to handle that same level of encryption anyway. We're going to have to create intermediary data in order to do that anyway. So an encrypted container, especially, we're not going to be able to bring that encrypted image all the way down to the booted node.

Zane Hamilton:

And then we have one more question on this. 

Zane Hamilton:

It was like, "Can you use any Docker container with Warewulf?"

Greg likes this one. I know you've talked about this one before. It's a good question.

Using Docker Containers With Warewulf [43:42]

Gregory Kurtzer:

I'm going to jump in on this one because we talked a little bit about the amount of headaches that I've had with having to deal with debugging and whatnot. This is another one. And if you notice in the Docker file that Jonathon has up on the screen right now, we're actually pulling from /warewulf/rocky. And the reason why we're not pulling just from Rocky is because we've had to actually make some changes to the default upstream containers. And part of this is because almost all of the default containers that everybody uses, like if you go and grab the Ubuntu container or the CentOS container, Fedora, REL, Rocky, whatever container, almost every single one of them was rendered inaccessible to boot. They can't actually boot your system. So inside of the Warewulf source code in the containers directory, you'll see Docker, and then you'll see like the Rocky file. I don't know if we can pull that up real quick, but I can show you what needs to happen to a default container as an example, to make it bootable.

Yeah, just hit the Rocky 8 one. There's quite a bit of additional packages we need, but we also have to switch out the coreutils that it comes with. It comes with a single user version of coreutils. And as you watch this process, you'll actually see that coreutils come out, and then this one that I'm specifically doing there will actually install that but, there's also some additional things. All of these unmasking of systemd services, for example, you have to get in there. You have to make sure you're enabling network and whatnot, but it's not impossible. My point in showing all of this and talking to all this is: it's just not going to work right out of the box. You have to make some changes to it because those containers are specifically built in such a way that they are not bootable. We have to make some changes to them and then once we've made those changes and we've pushed those changes back up, you can now use the Warewulf, Rocky 8 image for anything you want to do, and it'll be bootable.

Jonathon Anderson:

And the other benefit being like, we talk about all that work and you can see it here. This represents blood, sweat, and tears, figuring out all of these bits, but not only is it done, and it's easy enough for me now to just build a container that says, hey, from that thing over there, but it's also all just documented right here, how to do that. So, if you want to build a similar image from a different operating system, there's a lot of the work done for you here to be able to see this container file and see how it was repeatedly done for a later version of Rocky or for a different enterprise Linux or anything. The more you drift from that initial start, the harder it will be, but it's not impossible.

Zane Hamilton:

Thank you. So we're getting kind of close on time. I know there was one or two more things you wanted to talk about on this topic, Jonathon.

Warewulf SSH [46:54]

Jonathon Anderson:

Oh yeah. I skipped over Warewulf ssh. We should show that. This is new in Warewulf 4.3, if you've used a cluster shell or a pdsh, this is very similar, but since all that node information is in Warewulf, we have that capability here as well. So say wwctl, shh, or c [1-4] hostname, for example, or Rocky release actually. There we go. So we, very simply, have access to commands like that. So that's good to know, even if you've used earlier versions of Warewulf 4, this is a new feature that's pretty nice to have. I saw a question pop up there. Is there another one?

Zane Hamilton:

There was another question, and it actually leads into our next major topic. I was trying to hold off until we talked about ssh. So it is, "How do you use Warewulf 4 and OpenHPC without Apptainer?" 

Using Warewulf without Apptainer [48:01]

Jonathon Anderson:

I see. Without Apptainer. Interesting. The cool thing that I philosophically have prepared to talk about is doing so with Apptainer. Without Apptainer, the way OpenHPC expects you to work… so OpenHPC, among other things, is a pile of RPMs in a young repo that are pre-compiled HPC utilities and libraries and applications that the packages are built to deploy in/opt, which confused me at first because that implied to me that you would be installing all these on all of your different compute nodes, which seemed like a pain. It's meant to work with a system like Warewulf, where you installed those packages on your head node. They go into /opt, into a sub directory /opt, actually it's /opt/ohpc/pub. And then you're meant to NFS export that to all your compute nodes, and then those applications become available on all your compute nodes. And then you probably install something like Lmod to be able to load in different libraries and different compilers and different NPI implementations without having to know where they were installed or what versions are installed. That's all good. That's the traditional way to do it. But, once you start going down this container path on the system side, I started asking myself why we need to have all of that software provided as a systems concern, having all of those packages available and pre-compiled. The reason we've done software as a systems concern in the past is that building all that software and providing it and managing the dependencies and getting all of your configure options correct is a huge pain, but OHPC has done a lot of that work now and with it done, and then coupled together with the ability to build an OCI container and run it with Apptainer, there's really no limit to the amount of customization that the end user can have at run time. If we have a few minutes left, I'd like to show a couple of those containers.

For example, here is a container that I built to do a ‘hello world.’ So this is a very simple ‘hello world’ application that I wrote and then I compile it with MPICH and run it.

This is all done within the container. I bring in my source. I install the same dependencies I needed before to get OHPC working. You might notice that I'm not using the Warewulf Rocky image anymore, because this is not a whole operating system. It's just this minimal container for this application. I bring in the dependencies that I need. That's MPICH and then this brings in the GNU compiler and GLib for that version that it's compiled against and Lmod to make it easy to load those into my path. And then in my entry point for this container, I just say load up Lmod, load those modules that I need and exec my hello world. With this work done, which if you haven't built an OCI container before maybe it looks a little intimidating, but it is just shelf scripting at the end of the day.

Once this is done, I don't have to worry about MPI ever again. If you look at the Apptainer and the Singularity websites, they talk about the different ways to do MPI and there's the hybrid model and there's the bind model where you either have MPI book outside your container and inside, and you have to be very careful that they're same version or you bind mount your MPI libraries into your container. You don't have to do any of that. I don't have MPI installed on my head node and my head node is also where I'm submitting from. All I have is Slurm and Slurm has knowledge of the PMI-2 interface. And so, if I just srun, like this, I tell srun that I'm starting an MPI application with the PMI-2 interface. I want to run it on four nodes and these are 32 core nodes. So 32 tasks per node. Let's just do one task per node. Now it's not going to work. It just worked. Why isn't it going to work? It’s because it's live. So it's going to be because node four needs to be resumed or it's down. Why is it down? I'm just going to do it as a three node.

Dave Godlove:

I think earlier in the demo you'd answered yes and actually deployed the Fuzzball image to it.

Jonathon Anderson:

So I did, but I switched it to that profile, but I switched it back to default again.

Jonathon Anderson:

I'm just going to restrict it to three nodes for now, because that'll be easier. I have this three node MPI job running. We can see that it's got different ranks. Each one is on different CPUs. I can change this to 32 CPUs, if I want here. Now I'm running a 96 rank MPI job. There's no MPI on my head node, no MPI anywhere except for in the container, and I don't have to worry about getting the environment right for it. I don't have to worry about making sure LD_library_path is set up or MPI run and how MPI run works. It's just all in the container. So it's totally portable. There's nothing custom to this cluster in this container. Everything was installed out of OpenHPC and at the very most, you might want to see what version of OpenHPC your cluster is on. But I can't actually imagine that that would matter either. You just have to have a compatible version of Slurm, which should be standardized through PMI-1 PMI-2, or PMIx, whichever one of those your MPI happens to run.

A more complex example is this LAMMPSapplication that I built up. So similarly, this is still an OpenHPC container because I'm using  MPI to build it. But, I downloaded a copy of LAMMPS  into this container, unpacked it, built it, just using the instructions from their website, and then threw away the installation artifacts. And then I have a small wrapper here, which is different from the ‘hello world.’ This is only here because I wanted to be able to accept command line arguments. So if you just say, like bin/sh/-C for your entry point, then it won't pass command line arguments through to what you're running.

It's the same stuff that was there before: source/lmod to load it in, load your modules, and then run LAMMPs with whatever arguments got passed to it. So with that, not only does it handle command line arguments, but it also handles standard in, so LAMMPS by default will read its job file or whatever it calls it, the instructions that it's going to process from standard in. So I can curl that directly off of Sandia's website, pass it to my srun, change my end tasks to three, so I only need three nodes, and it just runs. And again, I didn't have to do anything at the system side, but also the user was not restricted in what MPI they could run, what compiler they could run. They could bring whatever they wanted and it just worked. And again, it would work exactly the same, no matter what cluster they would run it on.

Dave Godlove:

Super slick. 

Gregory Kurtzer:

How did you go from that Docker file to the SIF? Did you actually build it in Docker and then pull it in via OCI or the Docker daemon?

Jonathon Anderson:

I did it with Podman, but yes. So actually, a good way to show that, when I built Slurm jobs for it, I don't assume that you have it. This is the Slurm job that I built for it. It looks to see if we have the SIF yet, and if it doesn't, it will pull it down with Apptainer from where I pushed it on Docker hub.

Gregory Kurtzer:

Oh, very cool.

Jonathon Anderson:

You could take this Slurm job and run it on any cluster that has Apptainer, and you don't need to have anything and it will just run.

Gregory Kurtzer:

And another thing that you're doing here, which is really interesting, is that part of your  srun or your MPI exec or MPI run, you're actually just calling the SIF file directly as an executable program. You could also do that with an Apptainer exec line. Instead of writing that wrapper that you had inside the container, you can actually just exec the commands that you want, but this is really clean.

Jonathon Anderson:

So if you do it with Apptainer exec, though, you have to worry about loading Lmod and loading those modules. So doing it this way means that you don't have to do that at all.

Gregory Kurtzer:

Oh, interesting.

Dave Godlove:

I want to point out that the use of the pipe to pipe the curl into essentially an Apptainer command, that's a cool feature that's been around for a long time that actually, I don't see used that often.

Jonathon Anderson:

I was really glad when it worked. It was very cool to see that it worked.

You can see here that in my Slurm job, I don't do that. In the Slurm job, I pull down the file and then run it. This is where you need to be able to process command line arguments. That entry point using the wrapper lets this -n work. You can do it either way, whatever you would do with this LAMMPS binary will work with this container.

Zane Hamilton:

That is really cool.

Jonathon Anderson:

Just to go back to my high-minded philosophical argument here: because OpenHPC exists, it can be used in both of these environments. You can use it on the back end to build your cluster, or it enables end users to build their container environments. Because Warewulf supports OCI semantics, you can speak that same language of…one of the things that makes it difficult to adopt something like containers in an HPC environment is when the sysadmins don't know anything about it. And there's systems tools that use these same technologies and these same interfaces and these same languages, it develops this common paradigm that you can use to communicate with your end users. You're building OCI containers on the back end. You can more easily support and train your end users to build OCI containers on the front end. And nothing about the availability of this means you can't do it the classic way on the back end either. You can still install whatever packages you want to make a traditional OpenHPC environment on the back end. Nothing precludes you from doing that, but having these layered approaches to how you provide the services means that whatever layer the user wants to jump in on, they can do that and have as much flexibility and as much, or as little, hand holding as they want or need.

Dave Godlove:

So back to Sylvie's question, how do you use Warewulf without Apptainer? And the answer is, you don't have to use Apptainer to use Warewulf, but the two of them are beautiful together.

Jonathon Anderson:

Yeah. Without Apptainer, it just means it's the sysadmin’s job to do it, rather than the end user's job to do it.

Zane Hamilton:

That's a great call out, Dave. Thank you.

That puts us up at the end of time. I will open it up for any closing remarks you guys want to make. Jonathon, I really appreciate you putting all this together. It was a lot of really good information. It's exciting to see and thank you for spending the time to learn all this. I'll let you close out, Jonathon.

Closing Remarks 1:01:22

Jonathon Anderson:

Yeah. One of the main things we didn't have time to get into is what it looks like to load a local container into your environment. That was a learning experience for me yesterday. You can use both private repositories and local containers that haven't been pushed up into any registry at all. We talked a little bit about how you would do that with Apptainer with sandboxes, but if you would like more information about that, feel free to get in touch with us. It's pretty straightforward once you know how to do it and we'll be working on documentation for that. But otherwise, I think you can tell I'm excited about all this. I think that there's a lot of good stuff here, as we see how to use the different parts of the toolbox, and I'm looking forward to getting more people on board with it.

Zane Hamilton:

That's fantastic. Thank you. Dave, any closing remarks?

Dave Godlove:

Just that was an awesome demonstration, Jonathon. Thanks for putting it together. It's amazing that you've learned a lot of this stuff as quickly as you have and are now ready to turn around and show it to others.

Zane Hamilton:

Absolutely. Thanks, Dave. Greg.

Gregory Kurtzer:

I guess my closing thought would just be, there was a lot that Jonathon just went through and he went fast, but he also crammed a lot of really amazing information in on that. But the thing that I think I'd really just like to double click on is the simplicity of what was shown is also incredibly great. Jonathon showed a lot of back end stuff, what's actually happening on the back end and how to go deeper, but there's a lot of people that are using Warewulf almost as a Turnkey Solution. Like, you can go and install Warewulf on top of a base system and within five to 10 minutes, if you take all of the standard defaults, you can have a running cluster and there's quick start guides that are up on the website, which will help you get to that point to where it's just super easy. You pull the containers you want that are already preconfigured and you just start deploying. And it's a tool developed for system administrators. So granted, power users as I'll call them, but if you are a system administrator or you have some amount of system administration experience, being able to follow and then extend those quickstarts and immediately have a running system, I think is extremely powerful and it scales very well. So this could be a running system of 10 nodes, or this could be a running system of many thousands of nodes. And you'll have a system, a cluster up and running, whether you want to run OpenHPC, you want to run a custom variant or something else very, very quickly.

Zane Hamilton:

That's fantastic. Thanks, Greg. I would just like to close by saying thank you guys for joining us again. Go ahead and like and subscribe. And anytime you want to talk to us more, or you want some help, just reach out to us on our website. We'd be happy to spend more time with you. Thank you for joining. See you next week.