CIQ

Working Rocky Linux Migration Best Practices

February 3, 2022

Webinar Synopsis:

Speakers:

  • Zane Hamilton, Director of Sales Engineering, CIQ

  • Neil Hanlon, Solutions Architect, CIQ

  • Forrest Burt, High Performance Computing Systems Engineer, CIQ


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Good morning. Good afternoon. Good evening. Wherever you are. Thank you for joining us for the second of our series of webcasts from CIQ. We appreciate you spending your time with us. Today we have more discussions talking about migrations. We have some other things to update you on. Forrest, I appreciate you joining me today. Neil will be here shortly. Last time, we went through and talked about a migration script and what that looked like.

This time, we want to focus on what a migration looks like: like small business, medi business, how migrations take place, why you would do a migration. A lot of times, we find that people aren't just migrating from one operating system to another, but they may be migrating from one environment to another. They may migrate from on-prem to the cloud or on-prem to on-prem. Maybe, you have a new site you are trying to build. We will dive into some of those topics here in a little bit. We also want to let you know, we are CIQ. We are focused on the HPC community for an entire stack for HPC 2.0, we have several projects. We have Warewulf, Rocky Linux, Singularity, and some others. We have a lot of enterprise customers who are interested in Rocky as well. We are appreciative of your guys joining. I'll let Forrest introduce himself.

Forrest Burt:

Welcome everyone, my name is Forrest Burt. I'm a high-performance computing systems engineer here at CIQ. I work a lot with the user side of some of our high-performance computing stacks and I work with some of our high-performance computing operations. I'm excited to be here and discuss migrations as they relate to HPC.

Migrating HPC [01:27]

Zane Hamilton:

Let's talk about migrating HPC. I know you have been doing a lot of work on that lately and how that migration goes. A lot of the people you have been talking to are CentOS users. They have  the desire to move off of CentOS. They are looking at Rocky and a lot of them have actually been moving to Rocky. I think from the node perspective, it's a lot easier. I think there are a lot of challenges and a lot of things that need to be taken into account and looked at from the head nodes. How is that going? What does that look like? Can you help us from a best practices standpoint?

Forrest Burt:

Essentially, in any high-performance computing cluster, you are going to have more or less two different types of nodes. There are other things that you can look at. Some places have dedicated nodes for data movement or dedicated nodes for things like Grafana. Primarily, the two things you are going to be looking at in most HPC clusters are a head node or multiple head nodes and the compute nodes, which actually run your computations. Zane, as you alluded to, the compute nodes are typically pretty simple to update from this. Compute nodes are typically just nodes that receive their configuration through something like iPXE booting.

In some cases, your head node will have some type of provisioner, which essentially allows you to all at once affect what is going on with the configurations of those compute nodes. Essentially, because those compute nodes don't have anything usually stored on them, they are basically just for providing resources for computational jobs. When you are migrating those from CentOS to Rocky Linux, it is typically a pretty simple operation. A lot of places probably already have a system image or something like that, which is spooled up. It is their generic compute node image, which is served out to compute nodes when those need to be reset. Migrating your compute nodes on a cluster is basically as simple as changing those images you are using for different compute nodes there.

I know a lot of places have different types of compute nodes. Potentially, compute nodes that are just CPUs or have GPUs on them. Some may even have a lot of RAM for high memory jobs. There are obviously different considerations there. You will have to make sure that your GPU drivers and stuff like that are all set up correctly for Rocky. In general, the compute nodes themselves are pretty easy to reconfigure. Are able to serve out a new operating system and new stuff too. Zane, as you alluded to, it is where you start to get into head nodes, that things get a little bit more specific. For anyone not aware, the head node of a cluster is essentially the controller node for it.

Head Node Migrations Q&A [04:39]

A lot of clusters will essentially have one head node where you are running things like your job scheduler, your database services, that kind of thing. Sometimes, you will see configurations where there are two head nodes. One is acting as a failover for the other.If the first head node goes down, it can fail over and immediately pick back up on the other head node. Now and then, you will see two head nodes running. For example, I'll take the scheduler SL as an example. From time to time, you have one head node that is running the SL service on it and another head node that has your SL database servers and stuff like that on it. Head nodes can be kind of an interesting configuration. They are very specific to each site depending upon how you have configured things.

There are a lot of different considerations if you are migrating your head node from CentOS to Rocky. First off, you will want to look at backups. Making sure that your Slurm databases are all backed up. In general, databases like your monitoring stacks and databases are backed up. It is a pretty specific configuration. It is something that is best planned out in advance, because it is a drastic change to move a head node from one operating system to the other. In general, my comment is to plan that well in advance. Make sure that all the data you absolutely need in order to rebuild that server is ready and stored away.

Make sure that you fully understand where things are stored. Where Slurm is moving things. Where your monitoring stacks could be putting things. Where your databases are putting things. There is a lot of sensitive information, which needs to be preserved through migration.We would not want anything to happen on migration. As head nodes go, they can be a lot more of a complex case, which can almost be down to the server itself. Every cluster is a little different. There are some commonalities in the software, but, as far as head nodes go, that is something that is really best planned well in advance. It is best to make sure that everything that needs to be understood, which the head node is doing, is understood and made sure it will continue when it is moved over. There will not be any kind of interruptions and service because of the migration.

Challenges with Air Gap Environments [07:28]

Zane Hamilton:

Have you seen challenges in air gap environments where you are trying to install Rocky? How are you dealing with those where you are going to have to pull down an ISO or you need to get the packages that are outside? Has that become a problem, or is that something that you worked around pretty easy?

Forrest Burt:

Do you mean an air gap environment in general or an air gap, high-performance computing environment?

Zane Hamilton:

Either one. High performance seems to be the place that we see the most often. It seems enterprises have a little more flexibility when they can connect things to the internet versus the HPC environment, which seems to be a lot more secure. Has that become a problem? Is it something you are running into?

Forrest Burt:

I, myself, have not seen that too much. I know that, for example, in clusters, it's a pretty common configuration for your compute nodes to not have internet access at all. Obviously, from the compute node standpoint, serving out new images and stuff like that to them will pretty much involve an offline operation anyway. But, with your head node, potentially, that is an air gapped environment that becomes a little bit more complex. While I haven't quite migrated something in that environment, I would imagine there are pretty easy ways to get those files you need from Rocky’s different places. The ISO and stuff like that out there throw those onto just the standard boot drive or something like that. Be able to move that stuff onto there. My best practice is probably just the standard of what you would do. Just move that stuff over. Make sure you have those offline copies of repos and things.

Zane Hamilton:

Perfect. Thank you. I think Neil has been able to join as well, so welcome Neil.

Neil Hanlon:

Hey, thanks for having me.

Zane Hamilton:

Absolutely.

Neil Hanlon:

Just to add on to what Forrest was saying, Rocky's mirror manager instance that we're using, supports private mirrors. If any organization wants to have a private mirror, that only they are using and redirects all of their hosts to use automatically, that is something they can go and sign up on. Use the mirror manager to subscribe to a private mirror or two.

Zane Hamilton:

Thank you, Neil.

Migration Script Creation [09:14]

Zane Hamilton:

I think there was one thing you brought up to me yesterday that you wanted to update everybody on, around the script, the writing migration script.

Neil Hanlon:

Yes. On January 31st, CentOS had, in accordance with their end of life policy, removed all of the CentOS 8 content from their mirrors. I've been seeing it cause some people some problems in their CI environments and elsewhere. They are expecting those images, files, and our repos to be available. It actually affected the Rocky migration script as well because we were making an assumption there. However, we did have a little bit of foresight and pulled down a copy of the CentOS vault to some of our servers and have switched over that script to update your system to the latest CentOS version, using the copy of CentOS 8.5. 2111, which we pulled down and just orchestrated the whole update, and migration process from there. It is working pretty well so far. That was a great addition one of our community members added to the script.

Zane Hamilton:

If you downloaded that script before yesterday, make sure you go get a new copy of it because there are changes in it. You can go back and look and see what the differences are and add it yourself, or just go download the new one, and make it work for you

Neil, we were talking about migrations and I've done a lot of cloud migrations. I have also done a lot of data center migrations over the past 20 years. It seems to be something most of the enterprises, in the last 10 years, have figured out how to go about doing. They have their operating systems, they have their standards when it comes time to move enterprises a little bit further along, have done it longer, and are more mature.

If you look at small and medium businesses, a lot of times they just don't have the resources. They don't have the time. Obviously, you run into constraints. You have an operating system, which you have built out. I was talking about small businesses, for example, they have a closet full of gear. They don't really have a full-time IT person and they are downsizing their environment and that rack has to go somewhere. It is old. They are looking at moving those CentOS servers to something. They are trying to figure out what that looks like. If they look at moving to the cloud, they want to move to Rocky. They see that is not something we've built before. It is actually built by someone else and they may wonder; what do we do? How do we go about doing this? I think there are a lot of things that are just missed over time as small and medium businesses grow. They don't have enough time. The configuration management is missing. Automated deployments of applications are missing. Maybe they have fallen into tech debt. Things have gotten really old and it has become a problem. As a Rocky community, what can we do to help with that? Or what does that look like if you are going to do something like that?

Neil Hanlon:

That is a great question. I think the proliferation of the cloud over the past 10 years has just been something that has really affected anyone. To your point, with a small amount of servers, you do need a critical mass for it to make sense for your business to move over to a configuration management standpoint. Where you are not just configuring things by hand and it requires specific knowledge on different tools that are being used. Whether it's Ansible, Puppet, Chef, or any of those types of databases, you have to have the knowledge there to make it work. Often, there is a pretty big learning curve on making these tools work for you instead of against you. I faced that myself while learning a lot of them. It can be learning the best practices on how to use them. How to ensure that if your server goes down and you need to rebuild it, you can just do that.

It is tough. It takes a lot of time too. Especially for small and medium businesses, there is likely not a lot of time a person has, unless it is a full time IT person. Who can be redeploying VMs and making sure they are working. Also, to have a tech check their whole disaster recovery plan. If that is even something they have. I think the cloud is something that is very useful, but you have to be careful with it too. Especially, transitioning out of that CapEx/OpEx model, in which you can talk to your finance team and see if they believe you that the cloud is OpEx. The whole idea of using the cloud is it will save you money. Unfortunately, in a lot of the clouds, there are just ways that you can spend way more money than you have to because it's easier to click those boxes.

In some ways, there are different problems that are the same, which you have to solve in the cloud. For example, making sure that your architecture, your cloud environment, are working properly. Not overspending on cloud. For example, not spending 50k when you could be spending 5k a month. Those are not easy things for small and medium businesses to do. Also, configuration management isn't easy for small and medium businesses to do. They require domain knowledge of that specific tool, whether you are talking about AWS or cloud platform Azure, they just have to know the ins and outs of them. It can be scary to try and look at it, especially when you don't know how much you are going to be spending and everything else. The good news is that Rocky is available on pretty much every major cloud provider as a free image. You can just go down and pull and deploy that out onto AWS, it'd be a BPC if you have a web application or something running that you need your users to connect to.

A lot of the clouds really are adopting this zero trust network access format to access all your services remotely in a secure fashion. It also comes down to the architecture of the programs and applications you are running. To their credit, AWS GCP, and Azure have a lot of great documentation on how to architect those things in a cloud environment. Again, you have to take that with a little bit of a grain of salt because they may be the best recommendations, but they might not be the cheapest recommendations. In some cases, you may need to go and look at different options. You may have to architect the high availability or disaster recovery yourself, which may save a lot of money in the end. That is where these types of tools really come into play.

Zane Hamilton:

I think one of the other things that becomes a challenge and a problem is when you license an operating system to run on-prem, and it doesn't necessarily transfer into the cloud. You can have some licensing issues, or just because you have a license with someone or a subscription with someone does not mean you can run in the cloud and still get support. I think that is something that we have done a little bit differently, being able to support the community. It does not matter where you run it or where you installed it from. It really gives us a little bit of a different view. Love to hear your feedback on that Neil and Forrest.

Neil Hanlon:  

Licensing is a constant problem for everyone. I have seen it myself in CI environments. For example, not being able to use RHEL because it is hard to subscribe systems in a convenient manner when you are running jobs multiple times a day. Red Hat has added some stuff like some content access to help with that. Really, what I like about Rocky is being able to use it as a parallel or an analog store and run those jobs in my CI environments I need to, and to not have to worry about the licensing in that way.

Zane Hamilton:

Whenever you deploy from the marketplace, in any of the big providers, Amazon, Azure, and Google that image you deploy from the marketplace, we would support. It doesn't matter.

Neil Hanlon:

Right. It is not like you have to go and subscribe to a system or whatever else. 

Zane Hamilton:

Perfect. From my consulting days, I know it was always a challenge to find the time to do those configuration management pieces. To go through and make the effort of automating deployment if you had to. So you could move. It always came down to having to do something now. Now is the time to invest the time. However, getting someone to actually spend the time to do that automation was always hard. I think it became a conversation of;  if you don't do it now, you are going to run into the same thing in two- three years. Then, you are going to have to invest a ton of time and money. If you are going to make this move, now is the time to start doing configuration management. It is important to realize that many of the big tools out there today are supported on Rocky. If it is Puppet, Chef, Salt, or, Ansible, all those work on Rocky today.

Neil Hanlon:

Not only do they work on Rocky today, but it should be an experience that you can simply redeploy the server using Rocky and use that same configuration management, because it is a one-to-one compatible Red Hat Linux build, which is what CentOS was as well. There may be a couple of tweaks that you need to do, and there are definitely some pitfalls. We should definitely put an article together to make sure the Ansible versions are updated. The Rocky community has done a great job of upstreaming those changes into the different community configuration management tools, which Rocky users are applying. They have a vested interest as well in making sure that Chef, Puppet, and Ansible are working properly on the OS as well. A lot of it is fixed at the tooling level and thankfully a lot of it has already been fixed at this point.

Zane Hamilton:

Excellent. One of the other questions we have seen quite a bit up to today, has been what is going to be different, and what is going to keep Rocky from going away? Similar to what happened with CentOS. Neil and I ran into that topic on a call the day before yesterday. It was a very interesting question. There was a lot of passion behind it. I think it is important we describe the material of that call.

Neil Hanlon:

Absolutely. I think it is still a very sensitive subject because of the change last year.  Looking back at it a year later, personally, I feel it was not the best thing to do. I think there are a lot of other people out there, including those from Red Hat, who admit the change midstream of  CentOS 8 to CentOS Stream was not the best move. Perhaps, they wish it had been done differently. I believe that is how I see it too. However, I am really excited about the changes that stream has brought, which will enable us to help the enterprise community. The change will allow us to submit and contribute upstream to those jobs that are working.

In terms of it going away or being bought, I enjoy the structure that we as a community have come up with for Rocky Linux. It is very much led by a team of nice people who are providing guidance and expertise. They are involved in the set up of the government for that project and want to make sure something like Red Hat or CentOS, does not happen to Rocky. It is something all of us on the team are very committed to.

Zane Hamilton:

Neil, you bring up an interesting point when you talk about CentOS Stream and what it is. I don't know if everybody has a great sense of what it actually is and how it gets contributed back to. I think there's a lot of misunderstanding. Is it something you contribute to? Is it something you can't? Does it go to Fedora first? What does that flow look like and how do you actually contribute back to the stream?

Neil Hanlon:

Historically, the flow has been to spray something at Fedora and hope it makes it into a rail release. It was never really something that was helpful for enterprises or people who wanted bugs fixed or who wanted to get their code contributed up into things. It was often a multi-year process to be able to have those changes integrated. What Stream has done is flipped that around on its head. There is a little bit of nuance here with CentOS Stream 8 and CentOS Stream 9. Because of the change mid-stream, the contribution model is a little bit different. The way I can describe it, there is a great tweet by Carl George from Red Hat explaining about this.The CentOS Stream 8 is technically a rebuild of RHEL. Technically. CentOS Stream 9, flips that around and it becomes that Red Hat Enterprise Linux is technically a rebuild of CentOS Stream 9.

The contribution models are a little bit awkward for 8, but for 9 there is a very clear path for making contributions to the CentOS GitHub or GitLab creating tickets or merge requests and to have those merged backed in. In CentOS 8 Stream, it's a little bit different. You have to open some Bugzilla requests, but generally the contribution process is the same. And it allows us as engineers who are working on problems in Rocky Linux to see customers using. To upstream those changes into CentOS they will become packages that are just defaultly available or fixes that are defaultly available in the upcoming release of whatever that operating system is. In terms of the contribution, it is a little bit finicky and you have to wrap your head around it a little bit. I am genuinely excited about being able to shape the future of Enterprise Linux.

Fedora Q&A [22:43]

Zane Hamilton:

Excellent. What do you see happening to Fedora? Where does that live? You do not want to go all the way to the top anymore. Do you see sense going back up and going down? How is that going to look?

Neil Hanlon:

That is a great point. You have to upstream those changes again from CentOS Stream into Fedora because CentOS Stream itself is a cut of Fedora, which is a great point. It is the release cycle. I believe. I am not even going to say it because I am going to say it wrong, but the release cycle of Fedora CentOS RHEL, and then the derivatives thereof is very clearly defined now where it wasn't before. It was maybe not as public as most people would be able to see. There is that path knowing where a CentOS Stream 9 RHEL build is cut from Fedora. Whether it is Fedora 28, which was for CentOS 8 or 34 for Fedora 34, which is upstream of Red Hat Enterprise Linux 9. Those paths you can contribute to are fairly well established.

Zane Hamilton:

Excellent. Do you see from the Rocky community that it is massive already for the age? There are already 15,000 plus people in that community. It blows my mind enabling them to quickly give back to CentOS Stream at an enterprise level and to not have to go up all the way through. What kind of impact do you think that is going to make on enterprise Linux? How is that going to shape Rocky?

Neil Hanlon:

That is a really good question. In a lot of ways, at CentOS Stream, the changes were shaped by hyperscalers like Facebook who are looking to get their changes integrated faster into Red Hat, which they were using. They did not want to or have time to wait for those changes to be pulled down from Fedora into RHEL. It is exciting that there is a lot of work happening in CentOS Stream. Not only by community members, but also by developers. I think it is a really awesome opportunity for the Rocky people to contribute if they want to. Also, to have an area inside Rocky Linux to contribute specific to them. I think it is something that we will be seeing more of this year, as we deploy a new build system over there.

Zane Hamilton:

Excellent.

Hyperscalers [25:18]

Forrest Burt:

You touched on hyperscalers for a moment. I will jump in really quickly about the HPC side of things. One of the big concerns of Rocky Linux, at the moment, is that we are starting to see supercomputing sites. It is very well established in HPC that one of the standard operating systems, which is used across clusters, was CentOS 7 for a very long time. There were a lot of clusters out there that were running it for their head nodes and compute nodes. A quick cursory glance over the top 500 will show you tons and tons of systems on there that are using it. In a high-performance computing environment, typically, you have hundreds of users on at any one given time trying to share the resources, which are available across the cluster.

Job schedulers and databases already have a sensitive enough environment. It is difficult enough to get those environments ready for a maintenance period, which moving from OS to OS in your cluster is not desirable. It takes a lot of time and effort. It  takes a lot of planning to be able to do something that major with all of your high-performance computing architecture. The move from CentOS 7 to CentOS 8 was already something that I don't think a lot of sites were really starting to think about. Because, CentOS 7 is pretty well established as a reliable and solid HPC operating system. With the end of life for it being still a way out, there was really a lot of reason to switch over there.

I think what is great about Rocky Linux is that now CentOS 8, a clear upgrade path for some of these high-performance computing sites, is no longer so clear with CentOS 8 being end of life. I think that one of the great things about Rocky is it provides a necessary alternative for a lot of these high-performance computing sites that don't want to add the complexity of having to manage Red Hat licenses and across potentially dozens, hundreds, thousands of nodes at once. What Rocky provides is not only a free alternative to that, but a way to avoid having to, like I said, go into all that specific, net licensure management stuff that essentially management places have not had to do before, which adds a lot of complexity to their operations to have to scale out and manage.

Rocky really provides a great upgrade path for anyone in high-performance computing looking to move off of CentOS or even Red Hat if you are looking for something that does not have that type of management. I know managing software licenses for all these specific pieces of high-performance computing software is already difficult enough. One of the big draws of Rocky, I think to the high-performance computing community, is that it provides that community an alternative, which a lot of places rely on and have relied on for a very long time to run their high-performance computing clusters on.

Zane Hamilton:

Forrest and in those types of environments going through that, I mean, a lot of that stuff has  been around for a while. They get very solid, very stable performance, so they just don't touch them. They leave them alone for long periods of time. That has been around some of the smaller enterprises or even enterprises where you have legacy tech and things that you just don't know how to rebuild anymore. Are you seeing people start to adopt more configuration management, automated deployment type infrastructure or type processes into that environment? Or is it still trying to catch up?

Forrest Burt:

Honestly, it depends upon the site that you are at. We talk about small to medium businesses and how they are working with configuration management and automation. The first thing I think of is small to medium high performance computing sites, because it is a very similar thing. There are a lot of places out there that are basically just running a couple of clusters. There are the universities, for example, that are running a couple of smaller clusters that are meant to serve out some resources to their users at their institution. They do not have a lot of people, a lot of cases, or a lot of time to be able to look into that kind of configuration and automation.

Especially, as we see these things move into the cloud, we see a greater proliferation of these types of tools across HPC. High performance computing has traditionally not been a field that the absolute latest technology gets into very quickly. A good example is containers, which for a long time until Singularity was built, now container but originally Singularity. For example, there wasn't a good way to deploy containers in a high performance computing environment, which suffered from a number of security pitfalls and other issues with specific multi-user environments. We see a pattern in high performance computing of these technologies slowly making their way into that. Especially with the move of a lot of sites to manage more cloud resources, as opposed to actual very large on-prem racks of HPC servers. We are seeing a wider proliferation of those tools coming about. As a lot of HPC people start to realize these provisioning tools and being able to scale up automatically and load balance.Things like that become needed as our science advances and our simulations become more intense.

We want to run more. An example of a use case is running more GPS on our machine learning models. As we see the scales get greater and greater on things we will start to see more people move to the cloud, to look into automation and configuration management tools. The necessity is where the technology and the science is going. It will necessitate larger clusters and more power to be able to be easily put to the task at once. Even just for smaller sites due to the nature of how things are advancing.  We will definitely see more of that type of usage, as it starts to pick up over the next couple of years. There are some places out there that do not have that infrastructure yet. However, like I said, as there is more demand for the cloud, that will hopefully, even help to ease some of those tasks, which go along with management of high performance computing clusters on-prem. You will free up time to explore those technologies and maybe see greater proliferation from across the HPC community into HPC.

Neil Hanlon:

I like coming from the enterprise space. You can wait, especially with the pandemic now, years to get servers and switches. There is definitely a cost benefit analysis, which you can see for how much you are spending. Not only for your power but also for your space and internet usage. Including, how much time you are spending on human resources, making sure  everything is configured, patching those systems as well as the switches ect. Also, one thing the cloud does for the smaller side of things that might not have reached that critical mass is to have a team of people, who are responsible for just the networking for those servers. One of the things the cloud offers is you don't need those people.

That is not to say those people are not necessary. They are, in a lot of ways, needed in these cloud environments. I can go back and forth and spend a whole hour talking about the centralization of everything to these clouds and you see outages too. I think there are a lot of things you have to consider when you are moving to the cloud. It is not necessarily as simple as, I have 10 servers and that is going to be this many cores on some easy two instances. You have to think about your disaster recovery and moving to the cloud. It is a really great time to think about configuration management and to take the time as you said, there are still a few years left on the CentOS 7 RHEL. It will take time to plan your migration, think about how you can do these things in an ephemeral manner where it doesn't matter too much. If your server falls off the roof, or if us-east-1 goes down.

Zane Hamilton:

That is something from a consulting background we ran across. People who did not take any of that into account. They would think just moving a server to the cloud would make it to where they would never have an outage problem. They would also leave them on all the time. It is where people who run the cost problem are not actually identifying if this resource needs to be on 24 hours a day, or does not need to be on. Also, being able to turn things off and doing it in an automated way, you can save a lot of money if you identify those different types of use cases and workflows. I think it is important. It is interesting for us that we start looking at HPC in a hybrid model where you are, not just that on-prem HPC environment, being able to scale into cloud. It is new and exciting.

Forrest Burt:

As you both touched on, there are a lot of complexities to setting up these things in the cloud. It is already a complex enough task to set up an on-prem supercomputer. If you are standing up high-speed networks, like InfiniBand, you are standing up management networks. There are all kinds of different considerations that you have to take into account when you are building something on-prem. It is the same way in the cloud. I know that some of the major cloud providers have different types of HPC stacks that you can deploy. I can not speak to the maturity of them. I have not deployed some of AWS's different massive GPU deployments.

There are a lot of complexities, which are involved that are very similar to what goes on-prem. It is setting up resource managers, databases for those resource managers, and setting up the NPI stacks, which allow these instances on the cloud to be able to talk to each other. Figuring out what instances can support the high performance networking, which is necessary to make a compute cluster work. There are a lot of considerations that go into this. A lot of it translates pretty simply. You are already, probably, used to working with InfiniBand networks and your on-prem HPC. There are ways to get those in the cloud providers. It is not as simple as spinning up some large compute nodes and starting to run stuff on them.

There is a lot of complexity and as we have talked about configuration there, which can be automated and should be automated for the ease of people's time and getting things up and running quickly. Some of that comes from the ability to scale. There are specific configurations that are useful in a cloud environment that even beyond a simple cloud cluster would be useful.  A good example is auto scaling. A user submits a job to your cluster, which is far bigger than you have capacity for being able to automatically scale up the nodes there. You know that you are running in your cloud cluster is obviously something that you can not easily do on-prem. There are a lot of complexities that have to be taken care of as well in similar ways to how they would be on-prem.

The hybrid cloud model is very interesting because we see that coming up as institutions and different HPC sites have very large clusters. Which have expended huge amounts of time and money over the years. It is a big concern being able to still take advantage of those resources.  They do not have to waste money in the end by mouth balling them. There is a big need for hybrid cloud solutions in high-performance computing, especially those that could serve things out to clusters in the cloud clusters on-prem at once. As things move to the cloud in general.

Neil Hanlon:

What kind of things is the HPC world looking to solve in regards to inherent data gravity problems, which come from hybrid cloud models where you have maybe petabytes even of data that you want to run on workload and AWS. As well as, on-prem, or maybe GCP or something. Do we have good solutions? 

Forrest Burt:

I am not quite sure yet. This is the era that we are starting to get into big data. It has really only been within the last 5 to 10 years that we have seen the mass proliferation of, for example, one of the biggest consumers of big data machine learning. I think that in general, there are a lot of solutions that are out there for that type of data warehousing and large-scale storage of data. But as far as moving them between the cloud and these on-prem stacks where there may be high-performance computing clusters, which have large petabytes where the storage attached to them obviously can not be easily moved up into the cloud. I am not 100% sure exactly what solutions are being explored. I imagine that we are seeing a similar proliferation with how we are seeing the specialized hardware that is coming out to process this big data. We have different accelerators and things like that.

If it hasn't already started, we will see a very similar proliferation in the big data sphere. For example, as I mentioned, what we are seeing with all these different specialized hardware accelerators coming out,  perhaps, before we were not as concerned with the data. Because we were running small models, we were running things that were not difficult to move around. But definitely now that things are recently starting to get into petabytes and petabytes of data, that one model might need these machine learning model examples. They will definitely need to be tools there that allow for that easy movement of data between cloud and on-prem.

Neil Hanlon:

It is a cost to be concerned about too. 

Forrest Burt:

I was just going to say, because obviously that type of ingress and egress is pretty difficult even if you are using some archival storage solution. Those have pretty high egress costs to them. I think a lot of it is just figuring out at your organization what the budget model is and what fits. If you are going to be using the cloud extensively, do you want to spend money that it is going to take to move all of those petabytes of data and store them in S3. So you can integrate them quickly into your AWS cluster. I think in the end it comes down to what solution works best with your organization and what works best with the budget, because there are multiple different ways of going about it. Even per cloud, there are multiple different storage solutions. It comes down to what you think your overall cloud utilization is going to be. How much are you going to have to deal with that data gravity over time? I think that could guide a little bit of the decisions on where you move that data off of these on-prem systems and into the cloud end.

Zane Hamilton:

I start thinking about the way people started using the cloud and being able to connect, get a network connection from your local environment into the cloud, so that you had that transparency. I wonder if it is going to become that way with data as well. Where you will physically have your personal infrastructure sitting right next to the cloud infrastructure and some sort of connection between the two for speed and scale. It seems like that would be the next progression. If it has not already happened. It probably has, but seems like that would be where we're headed.

Neil Hanlon:

There are a lot of people in Ashburn.

Zane Hamilton:

Yes, yes. There are. This has been a great conversation. The last thing we wanted to talk about is pkexec. I know it is something that has been hot on the internet lately and everybody is talking about pkexec and the vulnerability. So now is a good time. Neil, if you would like to chime in on that.

Neil Hanlon:

Yeah, absolutely. That was a weird bug that has been sitting around in the open source world for 12 years, 15 years. To say the least about that and to look at the changes that have been made and discussions that have been happening in the open source community. The Linux RHEL, for example, there's a patch upstream. I don't know if it's been accepted yet to fix this at the XXV level where it will just air out if you try to supply these things. There are some considerations there too. I think we are in for a rough couple of months here, as people find more SUID libraries that are going to be affected.

For those people that do not know there is a vulnerability, last week in polkit, which provides a layer between the operating system and how things are supposed to be executed. There is this function called pkexec, which was able to be exploited to send it some bad environment information. Essentially any user could compile a script, a C program and get root. It points towards the importance of having a strong security model. Also, it lends itself toward things like rootless containers, where you can't escape, even if you try really hard. I think that is something Greg picked up on the day he was asking me, “Hey, the script isn't working.”

And, why isn't it working on Rocky? The exploit script wasn't working. It turns out it was just a GCC flag that needed to be set on. But, he went through and ran it on. I think he has a YouTube video on it, on Apptainer, formerly Singularity, showing the model with the SIF files and how the container runtime is orchestrated. It does not allow people to escape from that shell, even with that exploit. It was pretty neat to see.

Forrest Burt:

Especially with the vast proliferation of containers at the same time across everything.  A lot of these container run times are not exactly built with the similar security model that are unique to containers We were pretty pleased to see that being able to actually stop exploit because of the security features that are just built inherently into its runtime and in how containers are executed and how users are able to interact with those.

Neil Hanlon:

That is a great point. And there are more secure distributions out there, like Alpine Linux, for example, which was not affected by this at all, because they don't use GSE. They use Weasel as their stream compiler or interpreter. There are other things out there that are possible, and Alpine has container images that are great. I use them myself. But, having a container run time that is really thoughtful and insecure is something that I think is really powerful and important.

Closing Remarks [44:42]

Forrest Burt:

Absolutely.

Neil Hanlon:

And that is not that Kubernetes and everything else aren't getting there and working on it. But, I don't know.

Forrest Burt:

It should be noted that Apptainer was specifically built for those types of secure use cases. This is a really common type of thing that you would see in high-performance computing. The need to have that level of privilege isolation. There are other ways that Apptainer works to integrate than other runtime capabilities of the host into the container. There are high-performance computing reasons for that. But yes, it was very neat to see Apptainer stop that wholesale.

Zane Hamilton:

That is excellent. We have reached the end of our time, be respectful of everyone's time and I appreciate Neil and Forrest for coming on.

Neil Hanlon:

Thanks for having me.

Zane Hamilton:

I wanted to talk to you guys. I appreciate your time and look forward to part three of our conversations here in the next couple of weeks. So thank you again. Go like, subscribe, and follow us. Follow along with us. We appreciate it.

Neil Hanlon:

Smash That like button.

Forrest Burt:

Yes. Thank you, Zane. And thank you for having us. Thanks everyone for watching.

Neil Hanlon:

Thanks guys, bye.