CIQ

RCR: Provisioning: Stateless Vs. Stateful

October 20, 2022

Our panel of HPC experts discusses Provisioning: Stateless Vs. Stateful.

Webinar Synopsis:

Speakers:

  • Zane Hamilton, VP Sales, Engineering, CIQ

  • John Hanks (Griznog), HPC Principal Engineer, Chan Zuckerberg Biohub

  • Alan Sill, Managing Director HPC, TTU

  • Gregory Kurtzer, CEO, CIQ

  • Misha Ahmadian, Research Associate, TTU

  • Chris Stackpole, Advanced Clustering Technologies

  • Glen Otero, Director of Scientific Computing Genomics, CIQ


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Good morning, good afternoon, and good evening, wherever you are. Welcome to another CIQ Research Computing Roundtable. My name is Zane Hamilton. I am the Vice President of Sales Engineering here at CIQ. At CIQ, we are focused on empowering the next generation of software infrastructure, leveraging the capabilities of cloud, hyperscale and HPC. From research to the enterprise, our customers rely on us for the ultimate Rocky Linux, Warewulf, and Apptainer support escalation. We provide deep development capabilities and solutions all delivered in the collaborative spirit of open source. Today's topic, we are actually going to talk about provisioning in HPC, and we are going to talk about stateless versus stateful. And I know this can be a really exciting topic. I know there are a lot of very passionate views on both sides. Let's bring in our panel. Welcome everyone.

Gregory Kurtzer:

Hello everybody.

Zane Hamilton:

How is it going, fellas? I am hoping for an exciting topic. I know there are some passionate views on this, on both sides. I am excited to have this conversation. I am going to let everybody introduce themselves. John, it has been a while. Why don't you introduce yourself?

John Hanks:

I am John Hanks, Griznog, and HPC system for a long time. I don't know how much I should introduce myself since I have been here a lot of times.

Zane Hamilton:

It's true.

John Hanks on Stateless

John Hanks:

My stance on stateless is that in my environments, I target only having three OS installs to disk. Those are two backup Warewulf controllers and DNS DHGP servers, and then one server that acts as a backup should one of those two fail. It is basically just the matching hardware. Then I shoot for everything else in my environment, being provisioned, stateless with the OS and ram.

Zane Hamilton:

Fantastic. Glad I know what side you are on. Alan, if you're still having a little bit of trouble with audio, I am going to go ahead and skip down to Greg. Tell us who you are. We might not know. 

Gregory Kurtzer on Stateless

Gregory Kurtzer:

I have been having this debate for 20 some years now. At this point,  Warewulf has done a lot to really help with stateless installations and stateless cluster management. That was always the first thing that people would always harp on, is it stateless? Is it stateful? How is it working? There was always a lot of convincing that seemed to happen in the early days. This is okay. It is going to be stable, but the operating system's not written a disk. How is it going to be stable because you have memory and memory's good. Those were the conversations, which we would have. I am looking forward to this conversation. I think there are a lot of really great insights as well as pros and cons to doing both. This is going to be a fun one.

Zane Hamilton:

Excellent. Thank you, Greg. Chris, it has been a while. How are you? Welcome back.

Chris Stackpole on Stateless

Chris Stackpole:

Thank you. Staying busy. I am Chris Stackpole. I was working for Advanced Clustering Technologies. We are a system provider, so we will build you out of cluster and we provide a lot of utilities to help manage that cluster. We do both stateful and stateless systems. Personally, I am going to be on the stateful side.

Zane Hamilton:

All right. Keeping track over here, making sure I know who is on what side. We shuffled around. Glen, I think I know what side you are on too, but introduce yourself and give me a hint.

Glen Otero on Stateless

Glen Otero:

Hi Glen Otero, Director of Scientific Computing, Genomics, AI, and machine learning at CIQ. I remember first when Warewulf went stateless. I was really excited about it. I am actually, I am not going to take a side because I think it just, just depends on what you are trying to do. In my last role, my last employer, I found stateless to be great when I was in test and dev, right? I was trying different OS to different OS versions and things like that on compute nodes and applications and testing. But, things in production just did not seem to change all that much. That is where I am at now. It is really great when you have a lot of churn going on and different things you want to try, but in production I do not see as much of an advantage so it depends.

Zane Hamilton:

All right, Misha, welcome back. It is good to see you.

Misha Ahmadian on Stateless

Misha Ahmadian:

Thank you very much. Good to see you all.  Misha Ahmadian from High Performance Computing Center of Texas State University currently we have Warewulf three running on the production cluster, and we just happen to have a new test cluster, which is a small cluster we put together. I was able to install Warewulf 4.3 on it. I am still playing with Warewulf 4 on Warewulf 3 side, you can have both stateful and stateless. We barely use the stateful most of the time. We just shoot the nos in the stateless, state, but I do not know, I am still feeling like if we go fully in the stateless, then I am missing something. Now I know that Greg was able to convince me before that, well, stateless is just enough for you, which I would like to hear more today.

Zane Hamilton:

Excellent. Thank you, Misha. Alan, welcome back. I am pretty sure I know what side you fall onto.

Alan Sill on Stateless

Alan Sill:

Well okay. I will carefully try to split the gap as Misha did. Misha is really leading our efforts here at Texas Tech. We also have some clusters that we run for the other side of my slash here, the NSF funded a cloud and auto computing center where we are running some test clusters aimed at renewable energy settings trying to explore that parameter space of how to run hardware when it is in the middle of cotton fields. Winding back a little of the history, we actually do run the production cluster in a stateful mode. As Misha said, we almost always, if we reboot them, we will choose to reshoot them to fix problems. I am a fence sitter on this. A year ago, I probably would have argued strenuously in favor of stateful, just because I do not like having to dump all the software over the network.

Our old system was, we reinstalled all software after every reshoot, the new scheme, we installed all the user software on centralized resources. We have shifted the burden to those NFS servers, and we have been engaged in a long catch up game with respect to how to provision a cluster when all the software is relying in a central resource. To summarize, in the early days, we used stateful because there was so much software to install in each node. Now that is not the case, but that has really shifted the low elsewhere.

Zane Hamilton:

Excellent. Thank you, Alan. I am going to start off and go back, John, I am going to ask for you to tell us what is stateless? What does that even mean?

What is Stateless

John Hanks:

It is a terrible term. It does not make sense anymore to say, stateful and stateless. What you are really saying is, regardless of where you at your OS, you are going to reinstall it on every boot so that after every reboot, that node will be a fresh install. The idea that you would put the OS in memory is super appealing to me because I like VOS to go really fast. But if you were to have a system that wrote that onto a dispartition, that every node had a dispartition, it would not matter. It is the fact that you are reinstalling from scratch on every boot is what makes it stateless for me. Then the other side of that being a terrible term is I run all my storage servers stateless. Clearly, I am not reinstalling all the data on every boot. Those nodes keep the data alive continuously, the data's always going to be there. That is where the disks are. It is just the OS that we are talking about when we say we do it stateless.

Zane Hamilton:

Thank you. Chris. I am going to let you tell us about stateful. What stateful means.

What is Stateful

Chris Stackpole:

Well, it is interesting because, using that last definition, I am almost on that one. The way I would usually define stateful and stateless is whether or not there is a disk drive to have the OS on. I still highly recommend that your systems are set up such that when you reboot, intentionally reboot, you get a fresh image because that fresh image would be your golden image, which you always know is correct. You can update the image, roll out whatever changes you make sure that is correct. Then every node in the cluster has that golden image. The difference for me on the stateful versus stateless is being able to boot back into the environment when there was an unintentional reboot.  Being able to go back to that OS and have that when things are not going when they are going wrong, being able to go back into that environment as it was.

Zane Hamilton:

Excellent. Thank you, Chris. So coming in from, from an enterprise world and hearing stateful and stateless I had a little bit of a different understanding. I am going to ask this question of Greg. How is stateful and stateless when we talk about HPC provisioning different than when we talk about stateful and stateless in applications?

Stateful and Stateless HPC Provisioning VS Stateful and Stateless In Applications

Gregory Kurtzer:

That is a good question. When we are talking about stateless in high performance computing context, usually not always, but usually what we are talking about specifically is the operating system. And is that operating system written to some form of stateful media? Now, there are, as John articulated, a lot of different ways to pull this all together. I have seen stateless implemented on disk full systems such that even the operating system was written to disk, it was written persistently, but on every single boot it was rewritten. It did not keep that state or that state was not typically persistent, if that makes sense. I have seen it both ways. When we are talking about microservices, when we are talking about applications in application state, it basically is talking about a way of dealing with microservices or services in a way that each one can scale independently.

Thus to do that, you cannot maintain a persistent state for each one of those applications. Each one of those applications have to be able to scale. If you are running an application, let's say in Kubernetes, and all of a sudden Kubernetes needs to scale up 10 more of these, if each one was maintaining state and memory, and that state was required to continue the lineage of that program, well, you cannot actually then just start spinning up more. You have to now deal with other ways of maintaining that state. In high performance computing, we typically limited the conversation of stateful versus stateless to operating system persistence. In many cases, even though this is kind of the well recognized debate, stateless versus stateful, I tend to go back more towards disk full or diskless. With regards to the operating system, and I will give one more point on this, if you can have a stateless operating system that is a disk full system that even manages disk state. You can do that with data. You do not have to do that with the operating system. The operating system could be volatile and every boot it gets rewritten, but each boot has an entry in FS tab to mount up this local file system, put it in the same location. One boot you could be running CentOS, the next boot. You could be running Rocky or Zeus, or Ubuntu, but that data is still exactly the same because it was written persistent to the disk.

Zane Hamilton:

Great. Thank you, Greg. All right, Alan, let's talk about benefits first and then we will come back and talk about the drawbacks of each. What do you see as the benefits of a stateful system in HPC?

Benefits Of A Stateful System

Alan Sill:

All right, so I am going to go back to the late police to scene when we still had a one gigabit control network. At that point, close to a thousand nodes and shooting each one took a lot of time. The six year old cluster, close to six year old cluster we have, has a 10 gigabit control network. The two year old addition has, which is most of our computational power, has a quarter of the nodes of the original setup and 25 gig control network. While the time consideration for shooting those has gone down, we have also, as I said, shifted the pattern to move to the bulk of the load of each of those was actually user software. The time to boot the node, all the cluster is drastically lower than it used to be, but the original motivation was it just took a lot of time to revision each node the way we were doing it before, and we did not want to have to do that every time we rebooted for some especially trivial brief and like changing the image or something like that.

Well, if you change the image, you do have to change it, but I am thinking of some more corrective action that you take on each node you do not necessarily want to reinstall all the software. We just wanted everything that was on disk to stay there. The other thing we did at the newer cluster is we actually cut the disk space in half. We actually do not even have space for all the user software. Logically, you would say these two arguments go against each other if you have shortened the amount of time, amount of things you have to load to each node, why don't you go to stateful? What I am trying to point out was that the items we were trying to preserve on disk had very little to do with the operating system.

If we were clever, we could have done things with partitions and so forth so that we could have had the best of both worlds. Also, I have to say, there is a certain amount of spelunking you have to do in the user support threads. People often had trouble getting one or the other to work. I have actually never been able to discern the pattern. Pretty regularly, you find things in the forums. I can make it work stateless, but I cannot make it work stateful or vice versa. To some degree, it is a random flip of the coin which way you get it to work and then you stick with that. 

Zane Hamilton:

Great. We already have our first question. In my experience, bare metal compute nodes do not get a reboot very often. Statement, not a question. Very true. All right, I am going to open up and let anybody else add to that of benefits of stateful. Before we talk about benefits. 

Gregory Kurtzer:

I do have a point on that. I have seen certain sites that have done reboots after every job that runs. As a matter of fact, I have seen some sites that have actually even reprovisioned, stateless after each job that runs. Some of these are in classified networks, and as a result that they have to really just manage all of the state and just they want a complete clearing of everything in that operating system before running another job on that system. More typically, I completely agree with the point in terms of how most people run their systems. We actually had stateless systems running at the Department of Energy that I mean in some cases. We could not reboot them just because of very long running jobs and user requirements. Sometimes, I mean, these things were up for I hate to say a year, just because everyone is going to wince at the kernel bugs and vulnerabilities that we might have had, but there have been times in which we have approached that. It does happen under some circumstances. 

Zane Hamilton:

Great. Chris, I think you had something.

Chris Stackpole:

Yes. Coming from my background here where we are dealing with when customers call us, it is my cluster is having a problem and there is a huge difference between my cluster is having a problem and the node just disappears and we have no idea what is going on, versus, oh, I can reboot back into the OS as it was and look at the log files. I can see a lot more information there. We have so many examples of software that really expects the OS to be in a disk form to be able to manage those files versus coming off of a NFS share, which we see a lot for the stateless systems. They have one image that is being mounted somewhere. One of the more recent issues that we saw was with Relic-6, they changed something in network manager and how it works with drag cut.

It was just puking because it expected to be able to write files in Etsy, which were being read only mounted from the file share. We see this a lot with different things. System D has got a lot of those issues as well.  One of the benefits of having it on the OS is who cares if every node is writing changes to Etsy with System D, or being able to capture all the log files of every detail. Especially, when we have seen a job that was causing a broadcast storm across the network it did not really impact the ones that had no US, right, in terms of interrupting the OS ability to function. There is a lot of that localization of having it all on the system so that it can just keep doing what it needs to do.

Now, some of those obviously are going to change because it depends on how you are doing stateless. I have heard people mention where they are moving the entire OS inside of memory, so you are giving up a chunk of your memory for it. However, big that memory is. Some people run really slim of a gig or two just the bare minimum, less other people pack it with every driver that they need and all of their software configuration. Then you are chomping what, 10, 16 gig of memory.  That can add up depending on what kind of jobs you are running, but you solve those problems of having that localized OS too. The exception of saving off log files, but that can be gotten around to doing a remote K dumps and having a remote log server. But now your complexity is going really high up too, because now you have to have servers to capture all that data.

The simplicity of just having an OS that you provision things to on your time schedule for security patches, or whatever and then you just treat that as it is cattle, it is still disposable. You can redestroy it and build it out again on your whim, but it is still isolated for any type of troubleshooting. That really helps somebody like me who is coming in after the fact. I do not know the system, I am just trying to troubleshoot what is going wrong, which tends to be why I like to have those stateful systems.

Zane Hamilton:

John, I think you've had something you wanted to add.

Workarounds for Stateless Problems

John Hanks:

Yes. I would say a lot of those points that Chris just made where stateless is a problem can be worked around fairly straightforwardly. The one about memory though that is a case where I think Greg will back me up on this. I push this about as extreme as anybody could possibly hope to push it. I boot a 16 gigabyte image. OS image and memory to me would be adorable. I boot an image that once it unpacks is at least 32, and I set aside 64 at least gigabytes of memory for that OS to unpack into. All my nodes though have local NVMe drives, which get configured as swap. The minute a job needs any of that OS memory, it swaps out to that NVMe and we do not worry about it anymore.

The memory impact is effectively negligible at runtime because the OS will swap out to the swap space pretty rapidly if there is memory pressure. Because we build our nodes for life sciences, I think my smallest node probably has 256 gigs of ram. This is not a huge problem for us. We would not have very many, usually nodes are 5, 12, or one terabyte, which is the minimum specs we would have for nodes. That problem goes away. Logging and stuff, I realized as he was describing that I also cheat. I almost never do root cause analysis of a failure. I reboot, upgrade, oh, that fixed it, and I move on. I never ask why something failed. I get to avoid that kind of complexity by just ignoring that root cause analysis is an actual thing that people do.

Gregory Kurtzer:

I love that.

Chris Stackpole:

It is harder when you are on a system that has a smaller amount of nodes not a huge budget where all those nodes really matter, and you have one that just will keep rebooting suddenly without any explanation. And you are trying to actually figure out is it hardware problem, is it your job? Those types of things.

Gregory Kurtzer On The Benefits of Stateless

Gregory Kurtzer:

One of the reasons why  I really became a fan of stateless early on was because I was maintaining fairly large clusters with a fairly small crew. It used to be that a single system administrator was expected to be able to maintain 50-100 servers, a hundred if you are really good, right? And these were all kinds of pets back in the pet era. Now, it is, everything is more towards cattle. When we started doing high performance computing, all of a sudden per engineer, this was going up to 500,000 to 750,000 nodes for one engineer. The scale just got massive. Now, how to deal with this, John, your point makes me laugh. It is accurate, right? In many cases, we just do not care about the root cause, you know, conditions of something.

At the same token, one of the things that, to get back to my point, one of the things I love about stateless is all the nodes are the same, or at least all the nodes within a group are the same. They are not only the same, they are identical from the software stack. There is no variation, zero. If you were to have a problem, node 300 is bombing out, it is rebooting, it is seg faulting, kernel panicking, whatever. Well, if the neighbors are working it is hardware period. Take that node out, send it back to the vendor. I have done this and this is how I was able to survive and not lose my mind. Well, anymore than people may already say I have lost and survived in a highly oversubscribed situation where there was just so many nodes I could not possibly maintain it.

I have told this story before, Chris, I do not know if you have heard this one, John. I think you definitely have. There was a time in which we ran an application across the cluster. It was a very performant application, tightly coupled. Every time we ran it, it always failed. It always failed on one node, and that one node would just we would get a seg fault, an application layer seg fault on one node in a stateless cluster. Now this sounds almost unbelievable that it could be a hardware issue, but there was no possibility, right? No other node exhibited this exact symptom. It was just that one node and it was always that one node. If we left that one node out, the job went to completion. But you add that node back in and that was the only application it would do it on, which we were able to at least find and we sent it back to the manufacturer, they sent it back to us and said, it is working just fine.

We sent it back to them again. They sent it back to us. The third time when we sent it over to them, I think it was the third time they said, well, we put everything through its paces. The only thing that we found that was off was the power supply was not quite giving quite enough current at full load. We did replace the power supply, but there is no way that can cause an application layer error. That was it, that was the problem. Power supply was not giving quite enough juice to that one node. Now, if this was not stateless and I was doing the normal process for debugging an operating system, I may never have figured that out just because I am stubborn. I probably would have just sat there and just continued to think there was some sort of weird software error or a memory error or something.

I never would have guessed the power supply. Because I knew all of its neighbors worked perfectly, I knew it had to be, hardware had to be something associated with that hardware. We did try network ports by the way, and other things as well, because we were stumped on this one. That is an example. It is an extreme example though, but it is an example of how having the exact same operating system with no state and no differences between any other nodes is really, really beneficial. And it makes it easy to maintain and to scale.

Misha Ahmadian:

I would like to add on top of that,  it is a good point that you made of stateless using stateless. If we know that all the nodes in the cluster are the same and you start provisioning them all in the same situation. However, we are running into this type of problem here that some of the nodes that just go down. They just hang and you do not have access to their logs after you bring them back up in the stateless fault. For those types of situations, we just do not know what to do. We do not know where the  problem is. Anytime we bring the backup in the stateless, you just see the fresh logs again once you reinstall the operating system. Now, I know that there are some workarounds that you can make the Syslog to write in like NFS shared location or maybe send it through the network.

I believe in a data center with thousands of nodes writing into a central location each node keeps writing to the message, and almost every type of message is every second. It could saturate the network somehow, or also that central location and hard drive that could be an issue. I mean, that is one of the things that stateful is helpful. When we want to debug a node, we put it in a stateful and then reboot it. If the issue happens again, we can look into the logs from that machine.

Gregory Kurtzer:

We have solved this using both k dumps and whatnot, going to a remote system we can catch kernel cores and whatnot. We also do remote Syslog. Your point is really good. If you have thousands of nodes with a lot of Syslog messaging coming through the pipe, I mean, you are adding interrupts, you are adding GIFs, contact switches and whatnot.That does affect performance. Absolutely, there are ways of managing that and to do that more efficiently. But the benefit that you get from that, I think, is well worth the issue there with regards to network bandwidth and even on a stateful system. I would actually suggest doing that. As a matter of fact, at the Department of Energy, we had to do that because we had to collect Syslog for all systems and all nodes and hand that not only to our control server, but actually hand that to the security team. Security got crash logs for every system in the department of DE network that we were on.

Chris Stackpole:

I think that some of these points  you are making are absolutely good and it is what I would do with my systems.The economy of scale here is what I think is being missed because you talk about the customers that I am going to talk to the most often it is a researcher with his graduate student of the semester, not admins, and they have 20 nodes, which are every five, every year they buy five different new nodes and dump the old five. There is no standard between the nodes. It is just however much they can get with the budget that they have been given that semester or that year. When you are talking about that, where you are now having to stand up a whole bunch of different services in order to have things for the remote logging and all of that other aspect, you are asking for a much higher burden when they do not have a dedicated CIS admin. That is not an uncommon use for clusters in HPC is to see sub 50 nodes being used in smaller scales. When you start scaling out to hundreds, oh, absolutely. That is what I have done on most of my systems, even the ones that did not get to a hundred is just to have that. Also, I was the dedicated admin with the skills to build that. I think the stateful makes it a whole lot easier for those users who do not have that skill set or dedicated staff.

Gregory Kurtzer:

Two questions. First, would it help and facilitate if a lot of this came preconfigured from a vendor? Second question, if they do not have the skills necessary to do a remote Syslog, do they have the skills necessary to debug those Syslogs typically?

Chris Stackpole:

That is a great setup to plug my company, Advanced Clustering. Thank you very much for doing that. Yes, I mean, that is what we do. We have an application, a cluster visor. If you are at super computing this year, we are going to have a big demo of it. We pre-build and configure all of it so that when you designate a restart, you can have a fresh new image based off of  a golden image. We pre prep almost all that aspect. Then, we use a cockpit interface to give them a web interface to do most of their administration. When they just need an add a user, it is a couple of button clicks. We do a lot to make sure that it is very simple for our users. I know we are not the only ones that do it, but I mean, I think that is where companies like the one I work for really shine is because we are not targeting these shops that have thousands of nodes because those shops are going to have dedicated admin staff.  We are targeting more of the people that they do have the graduate student of the year.

Alan Sill:

We used to do that and about three years ago we just declared an end to it. If you have money, you can buy resources out of our current system, or you can join a purchasing pool for the next upgrade. We have a standard configuration. We have a few cases where people have noisily lobbied for special circumstances, high memory nodes or long run times, stuff like that. We always find when we accommodate those, the result is wasted CPU cycles. We just do not allow it anymore. As a result, going back to Greg's point the nodes are identical. I mean, we shoot them with the same image whether we keep it on disk or not is our choice. You have almost convinced me to not. We will see how Misha does with his new test cluster.

John Hanks:

I would point out, I manage extremely non homogeneous clusters to the extent that I rarely have more than about six nodes in any given group that will be the same hardware phenotype. I do not think varying hardware as a blocker to doing things statelessly. At least in my case, the benefits I get from it that are not necessarily intuitive, but I take great advantage of would override a lot of additional pain if I had to go through a lot of additional pain. Probably, the worst one of that is I, with some introspection, I realized early on I do not like to write documentation. I loathe writing documentation. I do not take notes. I write nothing down. The fact that everything that is in my environment has to install it boot means that somewhere I have a script that is literal executable documentation for installing that thing and that is checked into a revision control system somewhere. Anybody can come and follow in my footsteps and see there is the script that does that thing. Man, what a good document writer he was. That alone is worth the price of entry to doing everything statelessly.

Gregory Kurtzer:

Can anyone guess what industry Griznog works in based on his language choices? I have never heard phenotype being used for hardware before. That was cool.

John Hanks:

I paid a lot of money for those biology degrees and I am going to use them.

Gregory Kurtzer:

Awesome. I have a story about Glen. When I first met Glen this was about 2000, 2001, Glen. I mean, this was ages ago. I reached out to Glen because he wrote an article in Linux magazine or Linux Journal, and it was printed and text and you can go to the bookstore and get a copy of it. It was talking about Rocksclusters if everybody remembers Rocks. He wrote this whole article on it and I said, I just created this thing called Warewulf and I would really love your take on it. I reached out to Glen about that. I do not know exactly the exact order of operations, but shortly then after Glen comes over to the lab, we sit down, we meet and we talk about all this stuff. I have to say, for 20 some years, I thought I convinced him. I did not know he was still on the fence. I thought he was an advocate. I am just looking at Glen going, oh my gosh, dude, you had me going for 20 years.

Glen Otero:

I have not been on the fence all 20 years. 

Gregory Kurtzer:

Oh, something recent.

Glen Otero:

Something recent. Yeah, yeah.

Gregory Kurtzer:

Oh, okay, let's have it.

Glen Otero on Stateless vs Stateful

Glen Otero:

No, like I said, it was just because a lot of stuff I did, whether it was using Rocks or OSCAR or openMosix or, you know, name your tool, right? It was always testing, validating things. Benchmarking stuff for customers or getting them up and running quickly. It was always just high churn. Stateless was just a godsend at that point. Because otherwise, like everyone was saying, because I was back building clusters using a hundred megabit networks 20 something years ago. Stateless sound was a great idea. Golden image was the currency of the kingdom, but it would just, like everyone has said, it just would take forever to reinstall just a dozen nodes. Like I said recently, particularly in heterogeneous hardware with multiple phenotypes in my last job, stateless and being able to move quickly and not have multiple images that need upgrading and just being lighter weight.

Stateless was great then, but once I tested everything and threw it over to the fence, they thought they would fossilize it and put it into something stateful and just run it in production. That is why it depends. I prefer stateless for OS and memory local disk for scratch. And, like Chris said, depending on your size, what you do at your logs is dependent on that or whether you care like John, for any root causes, just move on. I have not been deceiving you this whole time, Greg. I just want you to know that.

John Hanks:

You actually reminded me of another thing that I take a shortcut on. I realize I get away with a lot because I would never deceive my users into believing I am running anything worthy of the name production. Everything I run is a test system. That is all I run are test systems. I do not have to deal with this. Maybe that is why I can avoid it. I never have to deal with production systems.

Chris Stackpole:

At my previous job, we were researching and we claimed that we did not have five nines, but we had nine fives.

Zane Hamilton:

There you go, Alan.

Alan Sill:

I want to go back to something Chris said. I will preface this by saying, when you watch a computer boot, it is like watching the history of computer science unfold. You watch all these layers, it is, oh, am I enough of a biologist to do this? Probably not. Was it ontogeny recapitulates phylogeny?

Gregory Kurtzer:

One of my favorite sayings ever.

Alan Sill:

Yes, you watch a machine boot and you are just watching lizard forms of computing evolve into chicken forms and into something resembling the current system. A lot of this is being reexamined with open source firmware. A lot of attempts to make this whole process easier and all of it goes away in the cloud, except it does not, because you watch the boot process there and it is still doing this crap. For those of us who have had to flip the toggle switches on a PDP-11, to get it to the point where it would then pick up and read some medium and I can go back further than that if you want me to, but you get the idea of booting is something to be avoided, right?

It is a risky endeavor. You never know if it is going to hit something. That is of course a completely obsolete line of thinking. It is a question of what you mean when you say you are booting. Now, I bring this up because Warewulf specifically has features in the newest versions.  Here I am serving to you Greg, where you can isolate things that you used to think of as part of the operating system install process right after the boot process gets to the point where it can do that. Maybe, we should talk about those for a bit. What are some of the features that you would like to stress, which make Warewulf easier to use? I am thinking of the layers and I have forgotten your terminology actually.

Gregory Kurtzer:

The overlays,

Alan Sill:

 Overlays. Yeah.

Gregory Kurtzer:

There are a few points that were brought up as well that I think that this ties into. We have talked about heterogeneous systems, but in many cases you will have heterogeneous hardware, but you are trying to have similar interfaces. Again, to take this back into biology, it is like different alleles with similar express traits and characteristics in terms of the interfaces and usage of what these HPC systems are doing. There was that, was that good? Hopefully that was good. With Warewulf you could do this, right? You can have different kernel, different kernel features, different boot options and whatnot, as well as different images or different image versions to go out to different nodes and to Alan's point, even just different files or groups of files that are changed per node via templating. You can create templates that would basically say, okay, this node is going to do this or it is going to have this service active, or in this configuration file it will have this section.

 However,  in this other node via a template, you can have a different configuration. You can very specifically express the different features or  pieces that you want to that you want for a particular piece of hardware or a particular role. You could have IO nodes, you could have different forms of storage nodes, you can have parallel storage nodes.  You can have big MEH nodes, you can have GPU nodes, different types of GPU nodes. It is very good at doing all this stuff. At the same time as scaling out a single image to thousands of nodes, you can also subtly change that image or use completely different images for different subsets of nodes. It is very flexible from that perspective. The other thing I just wanted to mention real quick is and I just to come back to this using up, there are multiple ways of doing stateless.

One is something like an NFS route.  Chris, I think you were alluding to that earlier. Then, there is another one that you alluded to as well, which is basically using up RAM using a memory to basically store and run your operating system. That is the way I always prefer to do it. It does have a negative side effect, which is you are consuming expensive RAM where normally you want your applications to use that expensive RAM.  The fixes and the worker, or the work around however you want to look at it is if you have local storage, which I am assuming we do, because otherwise we would not be having that particular debate for that hardware. If you have local storage, you just set up a swap kernel is actually incredibly good at paging temp FS file systems and then paging it onto swap.

Over time, of course you are going to have a little bit of some, not bottlenecks, but some contention through memory as you are currently your operating systems there. Now, we have to go swap it out. Once that happens, over time, what you find is all of the critical pieces of your operating system that get touched often are going to end up staying in RAM in a small footprint. All the other stuff, which is 99% of your operating system, is going to end up heading over to swap, and then basically just residing in storage and on your spinning or flash disk. At the end of the day, you are not actually using that temp space.  As soon as, again, of course as soon as you reboot that system, all that swap space and memory is gone.

You are repurging this is a load on the network. At this point, I think Misha you brought up as well, regarding network, a lot of this just has to do with just the system architecture that you are building. A lot of people that I know that are building these have an ethernet boot and management network and then they have an InfiniBand or some other form of network and they do the same sort of thing with storage, right? They may have local disks or they may not, but they are also going to, they may have NFS for home and they may have Lustre for parallel storage and data storage. The Lustre may be actually communicating over InfiniBand. There are a lot of different ways to slice and dice this, but that ethernet fabric, if you have more than one network, that ethernet fabric really becomes the management plane.

In many cases, you are not actually consuming your data plane to do any sort of either file system transfer or anything. It all depends, again, on the system usage. In many cases, if you are thinking about doing a stateless system, you want to make sure you are building your system, physically building your system, in a way that would actually support that in an optimized way if you have not, stateless systems are pretty much the same architecture as what most people are building. Maybe, some subtle differences, but not much.

Chris Stackpole:

I think that design actually makes a lot of sense for a stateless where you are doing the caching, you have a large fast swap that makes sense. I mean that is very much the way Griznog described his, the NVMes. Then, having that dedicated network for that traffic. Where we have seen that really not go so well is in those like classified where everything that touches a disk has to have encryption at rest and everything else. They just do not put disks in their nodes at all. Those ones are fun for a whole different reason fun.

Gregory Kurtzer:

There is a way, Chris, of doing stateful with no disks. Have you gone that route?

Chris Stackpole:

No. I did look into that a while back. There was just way more network traffic when we were trying to deal with that than what we really had the bandwidth for at the time. It just was not practical for that particular setup.

Gregory Kurtzer:

Gotcha, gotcha.

Chris Stackpole:

But yes, I have looked into that before.

Gregory Kurtzer:

Usually that boggles people's mind. I am surprised nobody stared at me blankly or yelled at me over that one.

Zane Hamilton:

No. It brings up an interesting point, Greg. Because, looking back through enterprise, I have built thousands of web servers. My next question is, where else could this be used? Where else could stateless be used outside of HPC? Because, I think, of all of the servers I have installed and they are all identical and the only thing I am doing is putting a different config for a web server on them. Why in the world am I spending the time to install an operating system on those things every time? Is there somewhere else in the enterprise that would be useful?

Where Can Stateless Be Used Outside of HPC**?**

Chris Stackpole:

I think we are seeing that more, actually. I think we are seeing that because people are spending up the same node type for a Kubernetes cluster or the same node type for Apache web cluster or however they are doing it. It does not necessarily make sense to do cattle type installations when, or I am sorry pets type installation when you treat them more like cattle where you do not really care and whether you are provisioning in an OS via kickstart or something on the fly or whether you are pushing out a golden image or doing a Warewulf stateless really does not matter as much. I think in the enterprise, you are more set up to have things where you have a remote CIS logging where you have a remote K dump, where you have all these tools already built and managed by either somebody else or at least a team of somebody else's.

The other thing is that it makes it a whole lot easier to manage a group of systems like that for things when your security team comes to you and says, we need to have this type of stuff patched. We need to have this level of, whatever your security standard benchmark is, we need to hit that. Well, creating that one type and making sure it is deployed to all the different, and then having something like Ansible Puppet, Chef makes the alterations to put the right config for the right system, all of that. I think we are seeing that happen a lot more on the medium to larger size. Especially, as people who went to cloud and realized, oh man, that is so expensive. We have to bring some of this back in house. But they are already used now to this DevOps type thing where they were just treating these VMs just like cattle. Can we do the same thing locally? I think we are, as we see more of that shift back into localized data centers, some of those DevOps features are going to pull in a stateless OS.

Zane Hamilton:

John, I know you had something,

John Hanks:

I am debating whether I should say it or not. I I think a lot of this I am, anyone who knows me knows I despise large, large IT orgs for many, many reasons. But one of the things I always like to point out is a CIO, when they are looking at their next job, they are not incentivized to do things efficiently. They are incentivized to have a large staff and a large budget so that their next job has a larger staff and a larger budget. A lot of IT orgs evolve into this situation where they are not hiring people who necessarily know what they are doing. They are getting as many bodies on the org chart as they possibly can doing something stateless like what I do where my entire infrastructure is stateless. I try to literally provision everything stateless requires, for better or worse knowing what I am doing.

The idea that I could somehow have this system have 30 or 40 IT staff with their hands in it is completely ludicrous. There is no way this would work in a large org. This is the way this is done, and a lot of stateless setups that I talk to people that do this, it is done because of short staffing, because they do not have a lot of bodies to do all that. IT org enterprise stuff. Even DevOps can do DevOps on a stateless HPC cluster with a couple of people, or you can do DevOps with a giant DevOps team in Kubernetes. Which one is the CIO going to pick? They are going to go with the large team because that makes them to be a more famous CIO. I think that is why you do not see stateless the way we do it in enterprise.

Zane Hamilton:

That is interesting, and I have not thought of it that way. To Chris's point, when you start pulling things back in from the cloud, I still see enterprises having to put tools on them. That generates more of what you are talking about, John. You have another tools team that is putting on all the monitoring the security pieces. You are still, while you may be treating them like cattle, you are still doing so many things to treat them like pets.

John Hanks:

The number one way to make cloud look cost effective is to mismanage your local on-prem resources. Nobody does that better than a large IT org.

Zane Hamilton:

That is fantastic.

Alan Sill:

No cattle or pets, all this sort of decade old terminology. You know, my background is in grids and we have treated worker nodes like insects, right? Provisioning them was out of scope. Now, with cloud, it is in scope and you do have to think about it. I guess I want to mention a couple special cases and maybe Misha can chime in with some more. Typically on an on-prem cluster anyway, and also probably in some cloud situations, you have to have a few special nodes. They may be ones that you would like to have the ability to reshoot, but they do have some special conditions. Login nodes are good examples because they will have definite IP addresses that are unique to each node. Even if they are part of a login pool.

There could be we have not gotten around to provisioning our BeeGFS nodes and our new setup identically that way.  There are nodes that you basically do not want reshot. Even in a situation where your worker nodes are identical, you may have some special purpose nodes that nonetheless you want to manage through Warewulf. Misha, can you think of any other considerations? What is going through your head as you try to make this decision? Because, I am not going to make it, I am going to make you make it.

Warewulf In A Future Stateful Feature

Misha Ahmadian:

I think the new form of state that stateful situation is going to change the design anyway. I do not think that we would use Warewulf the provisioning NFS storage or any type of storage nodes or any node that requires to stay in the same with the same data all the time. I think in the future of Warewulf is just targeting a certain type of nodes, which is going to be the login nodes or worker node. I still think stateless makes sense in our cases, unless we run into some issues that we really need to move to the stateful since there are some workarounds to have the lock files and maybe make the disk partitioning stable across the nodes. These are the things that we would always like to target and make sure that they always have the same partitioning size, have the same size of the swap space and we want, and it is not going to deal too much with the memory, then I think we should be fine on this view run into some other problems.

Now, my question for Greg maybe would be that since we had this conversation here, would you like to consider having this stateful feature in the future release of Warewulf or No?

Gregory Kurtzer:

Believe it or not, this is one of the most common questions that we get. One of the FAQs that we get about Warewulf, which is can, where Warewulf is really good and built around the idea for reasoning stateless, but can we do stateful? We are thinking about how best to do it. If it fits, there is a whole set of Pandora's box that we open as soon as we decide to do that because there is doing it, which actually is not that hard, but then there is doing it right, which is really hard. We are trying to balance and figure out what is the best way of approaching something like that. If we would, most people that we have spoken with, really like the idea of doing the stateless for a couple reasons.

One is something you said, which made me think of this is you mentioned you probably would not do stateless for NFS servers or I will be more general, file servers or some sort of I/O. What if the operating system itself, the container that you are using to provision out was a preconfigured Lustre OSS, right? And let's say we had the entire Lustre stack already pre-created in containers for Warewulf such that you can now go provision, whether it be NFS I/O, whether it be Lustre I/O, GPFS, WEKA, you can choose whatever file system you want, expose that through Warewulf to the rest of the cluster. You can do stateless across the entire thing. To Griznog's point, you can basically take your entire cluster of thousands of nodes and only have one system actually installed on the entire system.

On the entire thing. There is only one hard drive being used for operating system state and management. That is actually exactly how we did it at Berkeley. We had 3,500 ish nodes. I think it was about 3,500. Our Lustre, our I/O nodes, NFS I/O nodes, everything was 100% stateless. We had it was not Warewulf so it was not containers, but we have VNFS images for every piece of our stack, and we can take those VNFS images and just hand them out to any nodes dynamically even, and kind of reshuffle how the cluster works and operates. One point, and I know we are getting near time, so I am going to, Zane, I am going to use this as my closing notes, so do not call on me for closing notes. Stack brought up a point that or a question he was responding to a question about other areas in which this type of cluster could be used, whether it be stateless or just clustering in general.

We have seen Kubernetes now, which is a big one. I remember the first time when Guitar Center and musician friends, anyone who plays instruments, reached out to me and told me that their entire web infrastructure is running on Warewulf. This was obviously almost a decade and a half, two decades ago. This was a while ago, but that was super cool to hear. The one thing that was really surprising to me, really surprising was I always thought that when we think of provisioning, stateless, what we are really doing is imaging, right? We are taking an image of a drive or an image of an operating system. We are pushing that out to compute nodes. We are running that image. Well, what was interesting to me was I thought HPC was the only industry that did this. Turns out we are not. Clouds do this underneath the interfaces that we see. Most clouds that now I have spoken to in terms of underlying architecture as well as very large, very large, IT infrastructures are doing imaging. They are doing it because it is easier and it makes more sense. That may be a follow up conversation as well about imaging versus doing something like config management.

John Hanks:

I would like to just drop a data point in for doing stateless management of ZFS NAS servers. My previous job we had at least a little more than 60 petabytes of ZFS NAS servers scattered between the US, the UK and Europe. All managed stateless from a single image. It worked fantastically well and there is absolutely no way we could have done that with any other method because we only had three admins for the entire setup and that was on top of the cluster and all the other things that we had to do, GP nodes and everything else for all those sites. I would say there is no reason to be afraid to run storage server stateless it works fantastically well.

Zane Hamilton:

Thank you John. And we are up on time. I know we had one comment kind of a question that could lead into a whole other topic, so we will do it very quickly. Bandwidth as a constraint when talking about doing a hybrid, like having an image in a rack and controlling all the network flow in a rack so that you are copying from same rack to server. Thoughts on this? 

Chris Stackpole:

Think it depends on how you want to do that. I think the easiest way is to have one image that is just broadcast out, but you are going to hit a lot of resource constraints because everybody is going to be hitting that one pipe for different reasons. The tool that we have built uses multicast so I can blast out as many nodes as I want all the same image and it does not matter if it is over a one gig pipe or not because everybody is getting the same bits. Then, Rocks used to do it with BitTorrent. Every client every node would install and start up a BitTorrent client for all of its own packages that it would serve. Nobody was ever really impacted by waiting on the head node to give it information. I think that goes to cluster design and cluster design of how you want to move your images around is going to be a whole different topic. Because, it depends on whether you are pushing at a kickstart and doing a fresh install every time you are pushing out a single container image or you are mounting it over NFS. I mean all that is different ways of doing this and bandwidth constraints are going to be different for each of those scenarios.

Gregory Kurtzer:

I want to add to something Chris said, agreed completely by the way. I would add in terms of response to this that bandwidth is a constraint, but typically not, there is not enough data, even when we are provisioning large images. There is just not a big enough requirement to do that segregation by rack. As a matter of fact, I would suggest doing it at a larger scale than just one rack. I know from a management construct, right, it is always nice to have a top of rack or provision out a provisioner on demand or something along those lines. Typically, a rack at the most. I think you are looking at like a hundred nodes if you are super dense. That is nothing for one provisioner. As a matter of fact, you do not really start seeing constraints until you start hitting about 500 nodes.

Those constraints are typically not bandwidth, but as opposed to you get DHCP issues and you get UDP issues on the TFTP streams, that is where you actually see failures first. The first thing to do is to actually manage, at least in my experience, manage your broadcast domains and then as you are managing your broadcast domains, that goes into your architecture of your system. I would actually say 500 is only a problem if you are booting all nodes exactly at the same time. Literally, exactly, at the same time. If you want to stagger them a little bit, you can actually get up to thousands of nodes with one provision or without any problem. Your networking people would probably balk at that.  In terms of, again, broadcast domains. An easy rule of thumb is 1,000 to 1500 nodes per control server and  you can easily go more than that, but that is the number I usually tell people and they usually double it just because we are HPC people.

Zane Hamilton:

Great. Thank you Greg. We are actually over on time. I am not even going to ask for closing remarks guys, sorry. I just want to make sure I give you your time back. We have asked a lot of you guys to be here. We really appreciate it. For those of you watching, please go like and subscribe, but we will see you next week. Thank you guys for coming.