Warewulf: HPC Cluster Management and Provisioning Platform
An introduction to Warewulf: an open-source HPC cluster management and provisioning platform by Gregory Kurtzer
Join us as we dig deep into HPC cluster management and provisioning with CIQ's Webinars and podcast series to help you get the answers you need and point you in the right direction.
Webinar Synopsis:
Speakers:
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Full Webinar Transcript:
Zane Hamilton:
Good morning. Good afternoon. And good evening. Thank you for joining. I am Zane Hamilton with the solutions architecture team for CIQ. For those of you who have been with us before we appreciate you coming back. For those of you that are new, welcome. We would ask that you like and subscribe so that you can stay up to date with CIQ. Today we are joined by Gregory Kurtzer, the founder of Rocky Linux, CentOS, and many other open source projects, to talk about provisioning.
What is Provisioning? [00:41]
Thank you for coming, Greg. Today we wanted to talk about provisioning. Provisioning can mean a lot of different things in this industry depending on your background. Would you give us an idea of what provisioning is from your point of view?
Gregory Kurtzer:
Great question. My background is in high performance computing. I spent about 20 years of my career working in high performance computing. When I think of provisioning, I think of provisioning the HPC resource. Typically we're talking about provisioning and management of the operating system. To me, provisioning is the fastest, most effective, and flexible way to manage anything from small clusters of nodes to large clusters with thousands of nodes in an operating system. Provisioning is all about the performance compute cluster and management of those nodes.
Provisioning From an HPC Perspective [01:53]
Zane Hamilton:
Excellent. From an HPC perspective, does provisioning involve more than server provisioning? Does it also involve user provisioning, network provisioning, and the entire stack?
Gregory Kurtzer:
Yeah we're starting from bare metal. For example, when somebody buys a high performance computing cluster they roll in racks of compute nodes. In some cases it could be thousands of compute nodes. They roll it into the data center and will provision everything from that bare metal up the stack. Starting with bare metal you have to build your operating system and get your operating system to the node in an efficient and scalable manner. From there you also have to notice if there are any custom configurations on particular nodes. Do those nodes include network configurations that include service configurations? Does that include user configurations? Some of these can be provisioned or managed after the node has been provisioned. Some of them have to be provisioned or managed before the system runs sbin/init.
Sbin/init is the parent of all processes with PID1. In some cases we have to do customizations to each node that's being provisioned before we even start PID 1. So there are some tricks that we have to be thinking about in terms of provisioning. I usually think of provisioning as the bare metal base operating system. That base operating system includes a number of different facets so that the cluster can function. It includes networks, service configuration, and user configurations.
Configuration [03:36]
Zane Hamilton:
You actually use an interesting word there. You're saying configuration. Whenever you're talking about configuration management and provisioning, are they different? Are they the same thing? Are they paired together?
Gregory Kurtzer:
That has come up a bunch of times in high performance computing especially in the context of provisioning an operating system. Warewulf and other cluster provisioning systems will typically do some basic rudimentary configuration management. This rudimentary configuration management occurs in many cases before the operating system has booted. For example, when you are making a network configuration, you will want to have your network configuration in place before you call sbin/init. Sbin/init will then go through systemd and start all services and devices. You need to have that network configuration already there. It is the same thing with things like fstab. You need to have your file systems already configured. You need to have service configurations configured before systemd runs those services.
There is a side to configuration management that has to be lightweight and engraved into the provisioning system so all of the nodes are getting appropriately configured. Whether that be something as simple as a network configuration and a host name configuration or something as complicated as a whole different operating system. You have to be able to manage that at the provisioning layer. Warewulf does all of this for you, but it does not replace a more typical fully fledged configuration management solution. If you need something that's going to do more complicated configuration management tasks, provision your base system with Warewulf. Do your base configuration with Warewulf. Once that node comes up, run your configuration management services. Then you can take over your CM platform, or whatever you're choosing.
Stateful or Stateless [05:51]
Zane Hamilton:
Excellent. Thank you. Something that I hear often is either “Stateful or stateless?”. So when we start talking about Warewulf, how does that come into play?
Gregory Kurtzer:
This question harkens back to the foundation of Warewulf, its basic tenets, and what has differentiated Warewulf since its inception. It was created around 2001. At that time high performance computing systems were starting to bridge over into Linux and Beowulf. Also during this time we were trying to figure out how to manage operating systems with thousands of nodes. To put this into perspective, if a system administrator is high load and complicated it may be able to have over 100 separate servers. If the systems are really good or very similar you might be able to get hundreds. An HPC system could have thousands, so we needed to come up with a better way of scaling the system. There are various ways to manage scaling the system. One was an important toolkit called OSCAR that used a provisioning tool called system imager. System imager allowed you to make a template of your operating system. Then it would blast that template out to all the nodes when you booted and it would write it out to the hard drives. From that point on, all of your nodes are provisioned. Then you manage them as a cluster of workstations. There are other tools that leverage things such as Kickstart which uses Anaconda, or the red hat installer mechanism for automating deployments of resources.
There is the Rocks Cluster System (not to be confused with the similarly named Rocky Linux). Rocks uses a turnkey system that uses Kickstart to provision out all the nodes. Warewulf approached this from a completely different perspective. Instead of facilitating the installation of a massive amount of compute nodes, let's completely remove the necessity to install any compute nodes and boot directly into a runtime operating system. That shifts the paradigm, right? Our model dumped a runtime operating system on the nodes everytime they booted. We're never doing an install. Then it runs directly from being booted. It runs like a live OS or a live CD every single time.
When Warewulf boots a node that node boots. It gets the operating system bite for bite, as it was intended by that node. The same thing occurs for all of the nodes above and below. It makes it very scalable and easy to maintain. When we talk about stateless in high performance computing provisioning, we are really saying that the operating system is not written in a persistent way to the storage. This means that you could reboot that node at any given point. As you reboot that node, it's always getting a fresh version of the operating system. Stateless is commonly confused with diskless nodes. Diskless nodes are run stateless. If you have a disk full node then you can choose to either run it stateless or stateful.
Do you want to write that operating system to memory? Do you want to run it directly out of system memory or do you want to run that operating system off of that spinning disc? There's a lot of different kinds of options in terms of how you can choose to provision that resource and that operating system to that node. The thing that has differentiated Warewulf over the years is the fact that it has focused on stateless. There were other systems out there that were focusing on stateful such as Rocks and Oscar, but Warewulf really focused on stateless operating system management. There are a lot of benefits in doing that when you're dealing with a large system at scale.
SegFault Story [10:58]
A story that I like to tell is, we had a compute node in a very large cluster. We had one compute node that would always segfault on one application. As we would run this application across all the nodes, all the nodes ran it perfectly except for one. This node would always segfault. When you see a seg fault you are usually going to assume that there's something weird going on with the software stack or or something software related. We were using Warewulf and I knew for a fact that the nodes above and the nodes below were running perfectly. It was only this one node that was having a problem.
We knew it was a hardware issue. I packed up this node and I sent it over to the hardware vendor and they said, “what's wrong with it?” I said, “I don't know, it's seg faulting the software”. They responded, “yeah, that's the hardware. You sure?” I said, “Yeah. I'm pretty sure it's the hardware.” They looked through it and they didn't see anything wrong with it. They wanted to send it back and I said “nope, nope, there's something wrong with it.” Finally, they get around to checking the power supply under load. It turned out that the power supply under load was not providing enough juice for the motherboard in the system. This shortage of power was interpreted by the kernel as a seg fault and not as a hardware failure.
They put a new power supply in the system and it ran perfectly from that point on. It changed the paradigm in terms of how we maintain and how we think about maintaining these large systems. Especially when you consider meantime between failure is actually not that bad. Unless you're looking at a thousand plus different different nodes. Then your meantime between failures is more often. So how do you manage things like this? How do you mitigate that? With a stateless operating system provisioning system, you can take out that node, put a new node in, change the configuration to say: the hardware address of this system is now changed. Turn it on. It has now completely assumed all of the responsibilities and roles of the previous node that was there. It was very easy to deal with.
How is Stateful and Stateless Related to Containerization? [13:22]
Zane Hamilton:
This sounds a lot like a popular topic a lot of people are talking about. This sounds a lot like containerization. How is this related to containerization?
Gregory Kurtzer:
So that's a great question. When Warewulf was first conceived, similarly to Oscar using system image, it had the notion of a golden image. It was this base template image of what you're using the template for provisioning out all of your compute resources and all of your compute nodes. In Warewulf we did it a little bit differently. We had a directory called a Ute directory. It was a Ute directory which allowed you to have another root file system in a directory location. Warewulf leveraged that as a template to make the bootable operating system for all of the compute notes. We didn't know it at the time, but what we were actually using was basically a container.
It was very similar to root a container that was essentially what we were using. We called it something different. We called it the virtual node file system and we managed this virtual node file system using tools that were very similar to containers today. One of the big switches that we made with Warewulf V4 is that we embraced the container ecosystem directly. Instead of having a separate version of this that we were using as a VFS, we said “let's just start using containers.” This has had the effect that all existing tooling people are using to build, secure, and validate containers can now be leveraged in Warewulf.
One of the biggest changes that we made with Warewulf V4 is now it is an OCI compliant provisioner. This means that we can use any of your Docker recipes or files. You can also use Singularity and Apptainer. You can create your container. You point Warewulf at wherever that container exists, whether it's in an OCI registry, whether that's on your local file system, or in the local Daemon running. You say, “I just want to import that into Warewulf.” Once you've imported that into Warewulf, you can say, “I want these thousand nodes you have to boot on this container.” What Warewulf does is it actually not only blasts the operating system out. It is actually blasting a container out to all of the compute nodes. You're running this container on bare metal. As a result of it, we get to leverage all of the existing tools and capabilities that people probably already have today in regards to how they are managing their containers. All that can be directly imported into Warewulf so you can provision out those containers.
How Was Warewulf Started? [16:48]
Zane Hamilton:
That's awesome. I know we already talked about this today, but I think a lot of people will be interested in how Warefulf came into existence. How did you decide this product was the best way to solve problems?
Gregory Kurtzer:
My background is actually in biochemistry. I decided to make the leap into Linux and open source as I was trying to solve computational problems for biochemistry. I got so excited over being able to do this in open source.I was so enamored with what we've been able to do in Linux and open source that it pivoted my whole career. I ended up going through some startups eventually landing at Berkeley lab at the department of energy. I was a joint appointment to UC Berkeley. I also did projects for UCOP. Around the year 2000, when I was first starting we went from scientific computing to HPC and Linux cluster computing,
One of my first tasks when I joined Berkeley lab was to start building up some big HPC clusters.During this time, as soon as I walked in the door I was tasked with an existing load of projects. I was completely taxed out. I was at one hundred percent utilization, if not more. They said “we’re going to add some big clusters to the mix and we are tasking you with their maintenance.” I thought that there has to be some way to help me with this becauseI am not going to physically install a hundred computers and then never sleep. As a government employee, I was already overworked. So I was trying to figure out the best way to manage, install, and maintain these systems over time.I was looking at different toolkits and asking myself, “what are people using?” None of them really spoke to me in terms of the way I wanted to manage this. I decided to develop a prototype. It was very similar to many of the things that I've developed over my career. It was so close to what I needed, but it was completely wrong at the same time. Singularity started the same way. Singularity version one started off very similar to Ubuntu Snaps then ended up morphing into Singularity and Apptainer as we know it today.
I previously worked at an organization called Linuxcare. We were very well known for creating these little bootable business cards that you can put into a system. It was a little rescue CD and I left there thinking all of the world's problems could be solved with a little bootable business card. If everybody carried these around, you can have everything you ever wanted computing wise. I started using Warewulf by creating ISO images for all the compute nodes. They were stateless. The main difference is I wasn’t using the network to transmit them. In the year 2000, I went to Linux world and presented this to the people there.
The feedback I received went as follows, “We really like the management paradigm of what you’re doing, but good God, what are you thinking with these ISOs? That is just such a horrible idea.” Also present at the conference were the people responsible for Etherboot. They advised me to go over to their booth and talk to them. So I went over and spoke with them. I learned everything I could learn about Etherboot in a few days. It took me four days to port Warewulf to be able to use Etherboot. Etherboot was a little firmware that you can either flash to your network card or you can boot it onto a floppy that had this piece of firmware on there. It sets your network card as a bootable device by loading an Option ROM into BIOS. Then the network card would actually do pixie booting. In actuality it isn’t pixie because that didn’t exist at the time, but it was very similar. That was when Warewulf really then took off. Once I got that built up and integrated it seemed like organizations everywhere started to adopt Warewulf and move to super computing.
Demo [22:02]
Zane Hamilton:
That's awesome. I'm not going to say that I've never used a Linux CD because I have. I had to use one to rescue some stuff. Can you show us what you prepared for us today?
Gregory Kurtzer:
I put together some slides to show you guys that, as a company, we do more than just Warewulf. If you’re interested in these things let us know. Warewulf is a big thing that we're going to be talking about. The name Warewulf originated as a software implementation of Beowulf. That is why Warewulf is spelled wonky. In fact it doesn’t feel right for me to spell Warewulf as the monster-human-dog thing. It just doesn’t feel right to me.
Warewulf is a software implementation of Beowulf. Beowulf is a specification for how to build a cluster using commodity parts, Linux, and open source. It's an architecture. This architecture is roughly drafted as you see below. You got your user workstations and users SSH across the network. They get into a control system that may have a file system attached to it directly and it's sharing it out. This could also be some sort of network attached storage and it's sharing it out. Then you have a switch. You've got a private network where all of your compute nodes reside and these compute nodes may actually even be on another network that may be a high performance data network such as InfiniBand.
Beowulf and very early versions of Warewulf were designed to solve this. As the system got more complex we added more features. We started adding to it in ways that users would be SSHing in. We would direct them through a reverse NAT in our firewall to land on our interactive nodes. The compute resources are in typical groups now. They are in groups because you may have different vintages. You may have different people that own different sections of them, but you have these different groups of compute nodes. You may have separate networks dedicated to each one of these groups. Warewulf, provisioning, and scheduling are usually done in some sort of redundant resource.
Typically you will have your shared storage like a home directory storage in a parallel scratch storage. That's very efficient. Warewulf needs to be looking at this not only as one flat system, but multiple flat systems. Warewulf was designed around the idea of maintaining not just simple flat systems, but also maintaining systems that have additional depth and complexity. When we built these systems at Berkeley, Berkeley lab, as well as UC Berkeley there's no state in the entire system except for the Warewulf master. So even the interactive node ran stateless. The entire architecture was stateless.
This meant that any system could blow up and we would replace it without losing anything. There was no context that was ever lost. This was how we were thinking about building systems. This is kind of how Warewulf is envisioned today. You're going to see in just a moment how you can have node configurations and then node profile configurations that are completely cascading. It really facilitates anything from a more simple kind of turnkey model going all the way up to something that will have a lot more need for flexibility. I've already set up Warewulf and built a node on this. Right now it's a virtual node so we can hack around at this a little bit and take a look. I installed it to usr/local because I didn't install it via package and usr/local is what I do. It's a usr/local and inside of etc/warewulf, you'll see you have got a few files. And if I take a look at nodes.conf, you'll see two major sections. This is YAML and it has two major sections: node profiles and nodes. Nodes are pretty easy to imagine. This is a single node. This is the configuration for a single node. We've got some information in here that is specific for this node. If you have hundreds of nodes, you may want to consolidate some of that information and configurations to profiles. Every node by default gets a profile called - default. That means you can add your configurations to the default profile.
For example, in this default profile, we have a container name which is Rocky-8t and we have a kernel version. I will show you how I got these into the system. What I end up doing is because node zero is part of this default profile and this profile has these values. Node zero has those values and I can supersede those. I can have multiple profiles that supersede other configurations and it's extraordinarily flexible. You can see in terms of network devices, we have a default network. What you can see is I have the net mask being kind of globally defined, but I have the IP address and the Mac address being defined specifically on just this one node. I don't want to put my IP address or my hardware address, my Mac address in a profile because all nodes would have the same IP address and same network configuration.
So that I specifically put into this node profile. This encapsulates what an entire system architecture would look like. It gives us the ability to do very cool things at a high level. I'm showing you this nodes.conf. You can edit and work with this nodes.conf directly or you can use configuration management. Most people use the tool wwctl. Wwctl is the primary interface for Warewulf.You can do wwctl node list -a. This will show all attributes and see how many nodes are configured. You can see in node zero, we have all of these different fields or keys that are associated with this.
We can see all the different values for them and we can see where we get that configuration. If it says default under profile, that means we got it from this default profile. For example, the container name and kernel version came from the default profile. That's exactly what you see here. This is listed here because there are some configurations that may have many node profiles. In some cases it's best to see where a particular value is coming from. If we may have a default profile on some nodes, I want to test a new kernel. I can have another profile for a new testing kernel and then add that profile right here. And we'll be able to have a certain group of nodes testing that newer kernel. It's extraordinarily featureful. I don't want to get too bogged down on this, but I really wanted to explain this because it is narrowly capable with regards to configuration of nodes. As we're looking at building systems that may be more complicated like this, it makes it very easy to group these nodes together with a profile and then define all the attributes for this profile. Warewulf gives you that functionality very easily.
What Is Required To Install The Warewulf controller? [30:40]
Zane Hamilton:
Real quick. Since you said you recently installed this, what is required to install the Warewulf controller?
What does it look for? How complicated is that?
Gregory Kurtzer:
Good question.Thank you for bringing me back to this. I meant to show this, but I forgot as I just wanted to jump into the tech. If you go to Warewulf.org you can click on docs. From docs, you can click on Quickstarts and we have an Enterprise Linux 8 (EL8) quick start. I will quickly run through this, so you can kind of see how this operates. The first section is installing it from git. You can install it from packages and releases. Basically you install the dependencies, you do a get clone, you do a make all, and then pseudo may install, super easy install. Excuse me, once you've done that, you run these commands to configure the firewall appropriately.
This is the warewulf.conf file. You're going to edit this and make the appropriate changes.You are probably going to need to edit your IP address and maybe your net mask. Then you probably also want to take a look at your range. Start and range end for the most part. Everything else is going to be pretty standard. Then what we're going to do is create some systemd to run this. You create the users Then start warewulf from systemd. At this point we now have Warewulf installed. What we're going to do is we're going to run a wwctl, configure all. What this will do is this will go through all of the system services that Warewulf needs for it to boot.
For example, it's going to need things like pixie. Pixie has certain requirements like DHCP and TFTP. It also needs NFS running via this configuration file. It'll set up DHCP. Make sure TFTP is running properly and drop the appropriate files in there. This will configure your underlying system. This has been tested and is known to work on Enterprise Linux variants, SUSE variants, Debian, and Ubuntu. We're going to start working with Warewulf directly. I showed you that I had a container called Rocky eight. This was how I got it. I just did wwctl container import. I pointed it at a Docker registry. You can reach this from anywhere. This is public on Docker hub and you can then import your Rocky eight image.
We're going to name it Rocky eight and we're going to set it as default. Setting it as default adds it to the default profile. Then what we're going to do is do the same thing with the Kernel. We're going to wwct kernel import and we're going to give it our current version name of the kernel we're running. We're going to set that as default. This is going to take a snapshot from your host. You can get the kernel from a variety of other locations. Take a look at the help output for kernel import. This will in its simplest form, take the kernel running from your control server. If you're running GPUs or if you're running InfiniBand make sure those drivers are currently installed on your host kernel or whatever kernel you're importing from. Warewulf will automatically pull all of those kernel drivers and support for that kernel. Then import that into Warewulf. The set default will put it right here.
If you didn't use the set default, this is how you would set your profile to be exactly as I previously demonstrated. Wwctl profile set. We're going to user set the default profile. This means assume yes set on the default profile. We're going to set the kernel and we're going to set the container. The next thing we're going to do is set it on the profile again. We're going to set a basic network configuration. As you can see this is exactly what has already been done. I didn't set the gateway. I guess I was slacking on that. Slack has a whole new meaning now doesn't it? I can't use that one anymore. Lastly, we can now add a node and you can do wwctl node add, and give it a node name. Then we're basically going to specify some network information and we're going to make this node discoverable. This means that when the node boots, it will automatically be added to the Warewulf database and it'll be automatically configured. It will automatically put in this Mac address onto the network device you have set as the default.
Warewulf Overlays [35:42]
The next thing to talk about is Warewulf overlays. So far we have taken a container out of Docker hub and now we are going to blast this out to “n” number of hosts. this would actually work perfect if every host was identical, but we do have some node specific things to take care of. Everything from network, service configuration, and user configuration. Some small amounts of data and configurations may need to be changed on a per node basis. That is done via overlays and I'll show some of these overlays features in just a moment. Overlays are a way of adding information to a provisioning system using text templates.
That is what I did to get this system up to speed. I'm going to start off by showing an overlay. If I do a wwctl overlay you will see all of the sub commands for overlay. If I just type in wwctl, you'll see all the primary commands. We're going to do wwctl overlay from here. Lets to a list to see what exists on this system. I've got two overlays: generic and wwinit. To show how this relates to the node, I'm going to do a node list minus a again. And you're going to see, we have got two overlays here: system overlay and a runtime overlay. The system overlay we're calling on is wwinit and the runtime overlay is generic.
These are in parentheses because they are not a configured attribute in the node configuration This means they are defaults. If you set them in the node configuration, the parentheses will go away and it will look like everything else. It'll be whatever you set it to, but if they're not set those are the defaults. So the wwinit and the generic overlay can be found here. If you want to see what's inside of them, you'll see, for example, the generic overlay has seven files and the wwinit overlay has 46. We are going to do a list.. This shows us all of the different files, their permissions, and ownership in the wwinit overlay. You can see all the different files that are going to be provisioned with this wwinit overlay. Notice some of them have the ending or the suffix of .ww. This indicates that this is a template file and just like autoconf with the .in, as soon as this gets resolved the .ww will disappear. The contents of this file will automatically be replaced with the templates that are in there. I'm going to show one.
If I do a wwctl overlay edit. The first thing I need to do is tell it what overlay I'm going to be editing a file in and do that. When I do that this looks like a regular if config file, except for the fact it has these template macros in here. These template macros get replaced for every node that is getting provisioned. This gives me the ability with a single template or with a single overlay. I can have any number of nodes booting with unique configurations for each one of them. The texting language is actually very featureful. If there's anything that you want to include such as conditionals, loops, or ranges, it supports all of that. It's very easy to customize this with large granularity. There is so much you can do. There are a few other things I can demonstrate. I’ll show you this one involving containers. Wwctl container and we're going to do a list. I've got one container imported called rocky-8 built and ready to be utilized. It's configured on one node. Do container shell rocky-8and you can see I'm now sitting inside of that environment. So if I do cat /etc/os-release I'm actually sitting inside of the operating system, the VFS, or the container. I imported that container out of Docker hub. If I exit this, it will tell me it's going to rebuild the container automatically. It skipped it because it's current.
It quickly did it at some timestamp comparisons and realized it doesn't need to rebuild anything. If you want to install something into this do something as follows. etc I'm going to put into the container's Mount directory. And so if I do this, you'll see I've got all the files there. So if you want to install something like the MOED, user space component you can get that into the container by using a bind and then doing an install just as you would with any other container or you can have that built via whatever CI pipeline you're using, to manage all of your containers and building your containers. I've kind of walked through the general interface of wwctl. I don’t believe there is anything else specific that needs to be addressed.
The goal of Warewulf is to be simple, lightweight, and very flexible. The commands that you'll see here should be fairly self-documenting. The configure command is to manage the system services as we did at the very beginning container. We walked through the kernel. I demonstrated a node. I can show that in a little bit more detail. We can add nodes, we can use IPMI and at some point we're actually working on some red fish integration. You can get a console on that and you can set status. When you do node status, you're going to see we have this node, but it hasn't booted yet.
So there's been no last scene. When we boot the node, we're going to actually watch this and see what happens. We also have a profile, which is the interface for managing different profiles. We've got power. This is basically for power management when you do things like if you want to reboot the entire cluster, it will automatically manage a parallelized fan out with an appropriate number of concurrent tasks. It does not completely overload your network. Half a second to boot each node as you're going through that loop actually has a profound effect in terms of booting efficiently. Also DHCP is very not scalable. The server is basically the Warewulf Daemon that is running actually comes from this binary. You can definitely interact with it here, but since we're running this through systemd it's better just to let this be and let systemd manage this. o system, ctl, and you can see it's in there and it is properly running. We're going to test boot a node and we're going to add the w option and watch this update. If I come over here, I just so happen to have a virtual machine that we are going to boot. This has been configured to pixie boot. Doing pixie booting on VMware fusion, as you can see that I'm doing, has gotten incredibly difficult because now everything is using max DHCP services. If anybody wants to know how to do that, and they get bogged down with DHCP issues, send me a message. I wrote a quick script that blocks max ability to do this.
When you boot this up, you'll see we're doing an UEFI boot, and it's basically going to start with pixie You got to watch fast because it moves pretty quick. You can see now we're sending it the container, which is happening down here at the same time. Next we're going to send it to the different overlays. Here is the K mods overlay system overlay runtime overlay. Once it's got that it now is managing a little bit on the back end. Now it's doing some booting through Warewulf init and we're now done. It was super fast. There was no operating system installed on this node. Everything came over the network. When you're actually booting on real hardware, it's actually faster. It should be noted that I'm using a fairly small and basic container.
If you have a bigger container it will take more time but it is a fast provision. It is designed to be very scalable for lots of nodes, parallel booting, and default updates.I also want to mention there are multiple overlays. I can even show you the different overlays. We're going to get into the weeds a little bit right here, but if you look at the default, pixie script, you're going to see that we've got four pieces that we're pulling into in a NRD. What you're going to see is we got our base container, we got our kernel modules, we got our system overlay and we got our runtime overlay. What we're actually doing is overlaying all of these overlays on top of the container.
This gives us the ability to start with a base and then add in the appropriate necessary bits that we need for this node to operate properly. The system overlay is only provisioned once at boot at provision time. The runtime overlay actually is reoccurring. If you have user accounts, we can look at what was inside of that. If we do the overlay list, remember the generic. So if we do wwctl node list-a, you'll see that the runtime overlay is set to generic. Let's take a look at what's in there, overlay list generic - long. And you'll see, we have user information in here. We've got our group's host file and password. As well as roots, SSH, authorized_keys. This gives us the ability for root to drop into that node easily.
What we have now is we have things like the password file and that password file is going to be automatically updated as users are added into the host system. If you add a user into this host system because of this overlay default configuration within one minute, all your nodes will automatically be configured to have that user account added because that's the update interval for the runtime overlay. You can see as soon as this hits 59, you're going to see another message which is another request to get the update from this node that's booting. You can also sort this file or this output based on what nodes have not checked in. You saw it turn yellow there for a split second. That's because it hit that maximum time for the overlay request.
If we don't see it in the update interval that we're supposed to see it in, then that's a warning. So it turns to yellow. You can search for all of those sorts of things as you are managing your node. If you have a thousand nodes it can be hard to find the one node that didn't provision all the way or the one node that didn't reboot via its IPMI command. There could also be a DHCP error. DHCP seems to be one of the most limiting factors of scalability of the parallel booting of hundreds of compute nodes at once. You may end up with DHCP conflicts. You can quickly see what nodes have been provisioned and what nodes have not using this tool. I think this is everything I planned to go through. Are there any questions?
BMCs and Pre-OS polling [48:55]
Zane Hamilton:
We have a question here. “I will have *many* questions that relate to tighter integration with the baseboard management controllers, as well as data center layout, switch configuration, and related topics. Let's start with BMC. Isn’t it possible to do everything that Warewulf wants to do through pre-OS polling and configuration file preparation by interrogating the BMCs for each node? (We have open-sourced software we developed to do this using IPMI)
Gregory Kurtzer:
The problem with using BMCs is it's not a standard interface for every vendor. Everybody's done their BMC differently and even many vendors have actually done IPMI differently depending on what additional features they're trying to add now at this point. IPMI is stabilized, but the BMC controllers are very vendor specific. We do not have a direct integration into the BMC controller with one major exception, that is to help secure the provisioning process. You probably noticed when we looked at the overlay list generically and we looked at the files, you'll see files like the password and group file. Are there now, granted there's no passwords in the password file. We stopped adding passwords in the password file 20 years ago.
There's still user account information in that password file. You may have other things that you put into these overlays, which you may not want users to find because we're using TFTP to download the first stage. Then we're using HTTP via pixie. That doesn't give you the option of having a whole lot of security. What we've done is if we look at the default pixie file, you're going to see that we're using things like the asset key and the asset tag. The only thing we interface with the BMC with today are things like the hardware's UUID and the asset tags. If you specify an asset tag on that hardware, it's interesting because users cannot get to that.
It requires a route to be able to get to that asset tag using something like a DMID code. It gives us the ability to leverage that asset tag is almost like a key or a token. Unless that token is properly passed to Warewulf, Warewulf will not hand out any of this information to anybody requesting it. It's actually a way that we can gain some security optimizations through this booting process. Whenever you're talking about security through provisioning, you have to have a root of trust somewhere. This gives us a very easy way of kind of managing that root of trust directly on the hardware via the BMC. Sorry that I didn’t directly answer your question.
What is Not Possible to Provision Through Warewulf? [52:32]
Zane Hamilton:
We have another question, I agree the ability to set up different profiles is very powerful. I’ve used it to provision nodes for diagnostics. What type of functionality could not be possible to provision through Warewulf?
Gregory Kurtzer:
Windows will not do Warewulf. There are a few types of operating system images that are difficult with Warewulf. For example, if you have very large operating system images, it can be very difficult to deal with their provisioning. It is more optimal to have small operating system images for a large scalable system. If you have more complicated requirements, containers are a better way to manage large dependency stacks,than putting it directly on the base operating system. If you have a very large operating system image it is difficult to get that out to all of those nodes in an efficient way.
The other thing is we are also consuming memory for those operating system images. If you have a big operating system image, hopefully you also have a hard drive that you can set up swap on. If you want to set up a swap space on your hard drive, you simply put that into your Fs tab or the kernel will automatically. The space we're using in memory as it's needed by applications. This whole operating system image and memory will end up being swappable. As long as you do have a physical disk on your system, just set up a swap space and that mitigates the issue.
Ansible Playbooks and Redfish IPMI [54:42]
Zane Hamilton:
Thank you for your question Herman, but I think this is related to what Alan posted. His question is “would you need Ansible playbooks or the redfish IPMI? Or would you need to write a playbook?
Gregory Kurtzer:
There are two ways you can use Ansible playbooks or any configuration system management toolkit in Warewulf. The first way is after provisioning. You use it to take over on the configuration management instead of using Warewulf for that. To be blunt, Warewulf is very basic in regards to how it does configuration management. It's very simple and you can't do a lot with it, but it gives you that core functionality of pre-running sbin/init, that you would need. The second piece of how people use Warewulf and Ansible is if you have multiple Warewulf control nodes for managing your Warewulf control node. It's a jab at HPC people like me. I've heard of a number of people from enterprises and clouds saying, “what do you mean you use SSH for nodes?”
We don't even want our administrators using SSH anymore. Everything should go through something like Ansible. We've heard that a number of times you actually can manage your entire Warewulf control server using Ansible or other configuration management and you never actually touch it.
Is There a Tool to Make Configuration Easier? [56:29]
Zane Hamilton:
Here is a question that you are going to love. Is there a tool that will let you lay out groups and nodes in a visual/gui mode that will make configuration easier?
Gregory Kurtzer:
So not at this point. It is something we've talked about quite a bit. We're dealing with a straight YAML and building a graphical environment around management of that configuration file actually is not difficult to develop. It's something that has to be developed. We have considered it and we are planning on it. We just haven’t done it yet.
How Many Nodes Per Warewulf Control Server? [57:16]
Zane Hamilton:
That is all the questions we have. If there are any questions, post them quickly.
Gregory Kurtzer:
One of the frequently asked questions is how many nodes can you boot on a single Warewulf control server? The number that I recommend is 1000 to 1500 nodes.. I don’t recommend this, but I have heard of large systems with 3,500 nodes without any redundancy or load balancing. The nice thing about the provisioning system is it does not have to be high availability. Once that node has been provisioned, you can take the entire provisioner offline.
If it goes down, all of those nodes are still provisioned. You're not actually affecting anything. It does not have to be high availability. In some cases you may want it to be load balanced. If you're trying to boot thousands of nodes at once. I have seen RFPs that dictate hundred and thousand plus node systems need to be booted and provisioned within 5 minutes from scratch. Warewulf is pretty good at that in a general sense. Granted, you may have to build up your control server with a very fast ethernet controller and, or split and bisect it between two separate broadcast domains. There's different ways of thinking about that and different ways of handling it, but you can definitely boot very large systems with Warewulf quickly.
Zane Hamilton:
That's fantastic. I don’t see any more questions. We are also over our allotted time. Thanks for your time, Greg. It's always great to hear the stories of how this stuff started and to be able to see it in action. I really appreciate it. I am looking forward to further conversation in the future.