Webinar

Fuzzball (HPC 2.0): Demos & Updates

Date

June 23, 2022

Webinar Synopsis:

Meeting The Team
Demoing Fuzzball
What is OpenFoam
Back To Fuzzball
Monitoring Features in Fuzzball
AI Image Demo
Results of Dall-E Mini
Fuzzball Compared To Slurm
Fuzzball and Warewulf
Fuzzball and VS Code
Workstations
Second Set of Fuzzball Images

Speakers:

Zane Hamilton, Director Sales Engineering at CIQ
- Linkedin
Gregory Kurtzer, CEO at CIQ
- Linkedin
Michael L. Young, Linux Support Engineer at CIQ
- Linkedin
Brian Phan, Solutions Architect at CIQ
- Linkedin
Dave Godlove, CIQ
- Linkedin

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Meeting the Team [00:00]

Zane Hamilton:

Welcome back to another CIQ webinar. This week we are going to talk more about Fuzzball and we have some people that are joining us. If we could bring in Brian, Forrest, I think Dave is with us. Hello everyone. I think everybody has met Forrest and Dave. There is Greg, but Brian is new to CIQ. I would like Brian to tell us about yourself, what your background is and what you do for CIQ.

Brian Phan:

Hi everyone. My name is Brian, I am a solutions architect here at CIQ. My area of expertise is in running cloud HPC workflows, mainly in the CAE space and also in the area of genomics.

Zane Hamilton:

Forrest and I have been talking quite a bit this morning and we would like to get a little bit of audience participation. I am going to call on Dave Ingram real quick. He was the first one in chat. If you guys could provide us with some random ideas and topics. Just a few words, it does not have to be complicated, but I think it is kind of fun the more off the wall that it becomes. Forrest, give us an idea of what we are looking for here.

Forrest Burt:

We are looking for something along the lines of a cat standing on a hill by a lighthouse or a Picasso style painting of Gregory Kurtzer. Something like that is what we are looking for here. If we could just get some suggestions in the chat, that would be cool.

Zane Hamilton:

And it does not have to be just one; if you guys want to just start, as you are thinking of them, throw random topics in there; we would appreciate it. It would be fantastic. Just real quick while everybody is still here, if you guys want to introduce yourselves, too. Forrest, you are in the top right, so I will let you introduce yourself real quick.

Forrest Burt:

Hey everyone, I am Forrest Burt. I am a high performance computing systems engineer here at CIQ. I work a lot with our Fuzzball system, representing different workloads, which we want to be able to run on it. I am very excited to be demoing and to be on the webcast today.

Zane Hamilton:

Excellent. I am going to go to Dave next.

Dave Godlove:

Hey everybody, my name is Dave Godlove. My background is in basic research. I have worked at the NIH and a few other places in that capacity. Also, I have some background in high performance computing and a little bit of background also with Apptainer.

Zane Hamilton:

Excellent. Thank you, Dave. We are going to go to Greg.

Gregory Kurtzer:

I think the trend here is biologists. I am a biochemist by degree, but turned into a high performance computing and open source guy. I have been lucky enough in my career to be part of a variety of different open source projects, community endeavors, and some cool companies. I don't know if everybody can see this Rocky Linux V9. So be ready for it.

Zane Hamilton:

Was that the background that won?

Gregory Kurtzer:

No, it was just a random one that came up when I first installed and you can see I am actually using it. I put it on a screen that just has a nice, pretty background showing that it is nine.

Demoing Fuzzball [03:49]

Zane Hamilton:

Very cool. All right. So Brian, I know you have spent quite a bit of time getting something together to show us. Go ahead and tell us what you are going to show us and tell us a little bit about it. Show us and tell us why you think it is important.

Brian Phan:

What I have prepared today is an OpenFOAM demo for running through Fuzzball. With CFD simulations, a lot of the timet, depending on the size of your model, you may require a significant amount of compute resources. This can be on the order of hundreds of cores. When you are on your on-prem HPC system, you submit your job, and sometimes you will be stuck in the queue waiting. If through the use of Fuzzball, we can connect your on-prem resources with cloud resources, that way you can bypass the queue, get your simulation running on the cloud, and get your results a lot faster. Let's jump into it. Let me start off by sharing my screen.

Zane Hamilton:

While Brian's doing that. We do have a few ideas that have come in. If you have anything else, shoot it over to us.

Brian Phan:

I am in this OpenFOAM demo directory right now. In this directory, I have an openfoam.yaml. While running the demo today, I am going to kick off the workflow, and the workflow will provision some cloud resources. Once those cloud resources are up, the workflow will begin running on those cloud resources. Let me just start off by kicking off the workflow first. While we wait for the resources to provision, we can jump into the openfoam.yaml, and we can talk about what is going on there. First to kick off key workflow, I am going to run Fuzzball workflow start users/bphan@ciqco/webinar. I am going to name this job webinar openfoam-demo and give it a unique ID of 1, and then I will call openfoam.yaml. Cool. As you can see, my workflow has started and for me to check the status of this workflow, I can call Fuzzball workflow status watch this and here.

As you can see, our workflow has started, the volume has been created, the image has been pulled, and our job is in a pending state. What is happening in the background as we speak is cloud resources are being provisioned. Once they come up, our job should be in a started state. Let's jump into the openfoam.yaml and let's see what is going on here. Okay. Within this workflow YAML file, we have a volume. This volume is an ephemeral volume and this is where our workflow is going to execute out of. In this workflow, we have a single job. Its job is called run-motorbike-simplefoam. Within this job, we are going to be using an image and this image is being pulled from Docker Hub.

It is open CFD’s latest default OpenFOAM container. The command we are running here is some bash. Basically, what is happening is we are going to be taking the motorbike example from within the installation, copying it to our working directory and setting up our environment by sourcing this bash RC file, then executing the all run script. Within the all run script, what is going to be happening is we are going to take our motorbike model, decompose that model into six parts, because we are going to be running on six cores, and this model will be meshed, and then the simulation will be run using the solver simpleFoam for 500 iterations. Once the iterations have completed, the results will be reconstructed. Next in our workflow YAML, we have some environment variables that we are setting. WM_PROJECT_DIR will just point to our OpenFOAM installation.

The next two allow us to run some NPI commands as root. These are running within the container, so that should be all good. Next, we have our current working directory. This job will be executing out of /data. Jumping into our resources, like I mentioned earlier, we are going to be running this simulation on six cores. We will be requesting one gig of memory. With our ephemeral volume, we are going to be mounting it to /data, which is also our working directory for our job. Let's jump back and see where our job is at. It looks like our job is still pending and we are waiting for the resources to provision. Once the job is in a started state, we can begin live tailing the logs of this job, and to do that, we can execute Fuzzball workflow. We can add a -f here to tail it. And we can put in the name of our workflow. Let's just copy and put that there. We could also give it the name of our job as well. As you can see above, our job has started and we can begin live tailing. Let's do that. Okay. As you can see our model has been decomposed with decomposePar and currently we are running snappyHexMesh, which will mesh our motorbike auto. I think it will take a couple minutes for that to complete.

What is OpenFoam [10:55]

Dave Godlove:

So I am sorry, I got bounced out of the studio for just a minute due to some technical reasons, but I am back in. I don't know if you already covered this, but I wonder for somebody who does not really know what OpenFOAM is, might not have heard of OpenFOAM, could you talk a little more high-level description about what it is for and what the motorbike simulation is?

Brian Phan:

Yes. OpenFOAM is an open source CFD software where engineers can run various types of physics simulations; it is typically used for just SIM. What is happening in this model? We are basically going to visualize the velocity of the wind against the motorbike model. After running these 500 iterations, we should be able to visualize these results in a visualization software, such as ParaView.

Dave Godlove:

You might use OpenFOAM, I guess, if you were trying to design a more aerodynamic motorcycle or maybe trying to design an airplane wing, which would provide optimal lift or something along those lines.

Brian Phan:

Yes, exactly.

Dave Godlove:

Cool.

Back to Fuzzball [12:22]

Brian Phan:

Cool. As you can see, our workflow has progressed. Our model has been meshed. The initial conditions have been set, and right now we are currently at the state of running simpleFoam and for 500 iterations, which should take another couple minutes. Cool. We can jump into the comment section just to see if we have any questions.

Gregory Kurtzer:

I think we have a question coming in from YouTube.

Brian Phan:

Can you modify the controlDict in the same way as you would without Fuzzball if readModifiable=True? Yes, you should be able to modify the controlDict. Within your workflow before executing, once your resources are set within the workflow you can set, you would modify controlDict and then from there, when you execute a decomposePar, your model should be decomposed accordingly.

Zane Hamilton:

Thank you, Brian. Yeah, by the way, these comments coming in are hilarious.

Gregory Kurtzer:

I have been contributing and talking about that in the YouTube channel. There are a number of good ideas and comments coming through.

Brian Phan:

Jumping back into our demo, simpleFoam has completed; reconstructParMesh has run. So our results, the mesh has been reconstructed and then reconstructPar we will take all of the partitions that the model was decomposed from and combine it all back together. In a more practical case, if you have a bigger model, you would want to run this more multi-node on hundreds of cores. If you guys are interested in that, please leave it in the comments and we will be happy to address that in a future webinar. Cool. That will conclude my demo for today. And I will pass the mic back to Zane.

Zane Hamilton:

We do have one more question, Brian.

Brian Phan:

Yes. For sure.

Monitoring Features in Fuzzball [16:09]

Zane Hamilton:

Is there some resource monitoring feature (e.g., to find out if all the cores are at 100% for the simpleFoam case?

Brian Phan:

Yes. I believe you should be able to. How I would check if the resources were fully being used is: within Fuzzball you can get a shell into your job, and from there, you could check if all the CPUs are being utilized.

Zane Hamilton:

Yes, there, the answer is absolutely. You can do that.

Gregory Kurtzer:

You could do it. That is a little bit of a brute force method, but that is a really good idea. I am going to bring that back, and I think, you know, Forrest, Brian, and Dave will probably do the same, bringing this back over to the engineering crew and see if real-time monitoring is something that we could put in for a later release. That is a great idea.

AI Image Demo [17:25]

Zane Hamilton:

All right. Thank you very much, Brian. That was great. We appreciate it. I do not know if you have been reading the comments. This is getting funny. I am going to turn it over to Forrest. Forrest, we were playing with this earlier today. This has been a lot of fun. You talk about a way to kill a lot of time really quickly and get some really interesting laughable results. Forrest, what are we doing today?

Forrest Burt:

Hi everyone. The other demo that we are looking at today is an AI model that is making rounds around the metaverse at the moment. There is a company called OpenAI that has produced a few different, very interesting AI models over time – text-based ones, I think they call them like GPT or something like that. Then also some image-based ones. They have one at the moment called DALL-E 2. DALL-E 2 is massively proprietary. It is something they have trained. You have to wait on a waitlist to get access to it. It is a little bit difficult to really, really mess around with. But someone has built a miniature version of it that they call DALL-E mini. DALL-E mini much like the larger versions, like DALL-E 2 essentially takes in some text input.

It has been trained on essentially hundreds of millions, billions of different images, of all different types of annotated training data. You give it a prompt and it generates images based on that prompt. If you give it a cat sitting on a cliff by a lighthouse, you will get hypothetically images of cats sitting on cliffs near lighthouses. It extends out to being able to do things in different art styles. I think I might pull these up here in a second, because I do have some results for this. But we have for example, you can do things like a Picasso style painting of Gregory Kurtzer or a person, something like that. Or basically anything with the cat and it'll give you a Picasso style painting of a cat.

DALL-E mini, typically you would get access to it. There is a version of it that runs online out there. There is a website you can go to put in your request. I got sick of having to wait for other people's requests to get done. I looked into it and realized that, oh, well, this model is freely available out there. You can go and download it and mess with it yourself. I figured I would turn it into a Fuzzball workflow and we would see how well that works. I am going to go ahead first off and run through the workflow. Then here in a bit, we will take a look at some of the results that we have generated. We have had some of these comments coming in so far. I think I have generated a rainbow colored cow fortune teller, a farmhouse with a cow on the sun, and a squirrel wearing a sombrero in the jungle. I have generated those. I have some results here. I am going to look through, or I am going to explain this workflow real quick. We know exactly where these came from and how this works. Then I will go ahead and take you through the results we have generated so far. I will go ahead and share my screen. We will take a look at this right here.

Zane Hamilton:

We do have some questions about the schedule we will get to, after we talk about this.

Forrest Burt:

You can see here in front of me, I have got basically just a VS Code window here. Yes, the font is a little small now that I look at it. My apologies there, but to take you guys through exactly what this workflow's doing, this is essentially broadly an example of running AI inference on Fuzzball. This is not like training an AI model. This is taking a pre-trained model and then actually running as I said, inference with it. To explain what we are looking at here: up here at the top, we have a file ingress section – sorry, we have a data volume setup section. We are going to set up a data volume called V1.

This is an ephemeral volume. It will only exist for the lifetime of this workflow. We are not ingressing any data into this because the input to this workflow is basically just the model itself and then some piece of text basically. We are egressing out these images into an S3 bucket. That is how we are getting access to these in the end. This workflow runs, it gets to the end, it generates the images, it tars them back up and then pushes them out to an S3 bucket. This is the – and I have named this job a little bit wrong; it should have been, you know, infer model or something like that, but I think I copied this template from elsewhere. So this is not training. This is inference. My apologies. The image that we are pulling down here is basically an Apptainer image that is based on this recipe right here.

I have taken the script. I have basically just taken the code that the DALL-E mini people provide for their inference pipeline as a part of a notebook that they have, like a Jupyter notebook. I have basically taken that code, put it into a script and then taken the rest of the setup that they give you in order to get the Docker file for this working and put it in there as well. I built this, pushed it out to a container registry and I have got it now deployed on Fuzzball. You can see I am providing credentials to that registry there. The command we are actually doing is the script with the inputs. You can see that right now, we are processing a saddle on a cow in a car race, a screaming gopher caught in a spider web, and Richard Nixon in a tutu throwing confetti.

We will see all those come back. We have cwd/data. This command is being run inside /data. We have a few environmental variables, the most important one is this right here. The pre-train model I’m pulling down from a website called Weights & Biases. I basically stored my API key as a secret within this Fuzzball cluster. It can be templated out and I don't have to display it openly or manage it outside of just inserting it into the Fuzzball cluster. We are using an AWS g4dn.8xlarge, I believe for this. We have 16 cores, 120 gigabytes of RAM we are requesting, and then one GPU. We’re mounting that data volume. Then once this is actually done generating the images, we go down here to tar up the results. We basically just take the nine images that we generated, put them into this; drop that into this directory here. Then up here at the top, this egress will trigger and we get the images out.

As mentioned, the script file I am using is basically just their code that they provide for the inference pipeline here, but I have taken it out of their notebook and just basically put it into a script. We can run it that way. We have more results now sitting in S3 to look at. To explain what I did there on the command line, I am basically able to control this from the integrated terminals that VS Code provides. I am doing Fuzzball workflow status. I am doing Fuzzball. I can do Fuzzball workflow log and this produces a lot of log barf. We are probably going to see a bunch of random stuff. We get a bunch of different kinds of log output there. You can see the prompts being printed. You can see it saving off the images. I can control the execution of this workflow from within my IDE.

Without further ado, let's actually take a look at the results that we have so far. This is in essence the DALL-E mini AI model running inference, orchestrated with the Fuzzball system, and we are taking live requests of what we want that to do. Let's take a look at what those have yielded, shall we?

Zane Hamilton:

And the exciting part.

Forrest Burt:

I know, right? Let's see here. Go ahead and download these latest pack of images. I have got it set to generate for time’s sake – I think if you go to their website and mess around with it, it generates you nine images per prompt – but just for time's sake, I have it generating three per prompt. I do want to point out that container that we are using is running on Rocky Linux. This model we are looking at here is running from an Apptainer that is based on a Rocky Linux space image. I just want to point that out. Here is our file of results from the first batch.

Zane Hamilton:

I am already smiling. That is hilarious.

Results of Dall-E Mini

Forrest Burt:

Once again, this was a rainbow colored cow fortune teller, a farmhouse with a cow on the sun, and a squirrel wearing a sombrero in the jungle.

There is our rainbow colored cow fortune teller. I don't know if this counts as an LOL cow demo, but you know, we are getting there. I think. We have that. I think “on the sun” may have gotten lost. It looks like we are “in the sun” here, but here are some cows in a field. It works remarkably well, apparently, for generating squirrels in a sombrero. We have that. Here is another cow.

Zane Hamilton:

Oh, there you go.

Forrest Burt:

Here's more cows in the sun.

Zane Hamilton:

Headless cow.

Forrest Burt:

Here is another squirrel with a sombrero. Here is I think our final of the three cows and there is us back in the sun again, and then here is our last squirrel with a sombrero. You can see that this works quite well. I will go ahead and pull up the other files so we can see some more of our very creative audiences' results. This one right here is going to be, let's see, a saddle on a cow in a car race, a screaming gopher caught in a spider web, and Richard Nixon in a tutu throwing confetti. Let's see how this turns out. There is a saddle on a cow, I think. Let's see here, there is a gopher maybe nebulously caught in a spider web, perhaps.

Zane Hamilton:

Spider web. Yeah.

Forrest Burt:

Maybe there is Richard Nixon throwing confetti.

Zane Hamilton:

Maybe not the tutu.

Forrest Burt:

Apparently not tutu might be a little bit too specific for this model, but we will see. There is another half-created cow.

Zane Hamilton:

Part of a cow.

Forrest Burt:

There is another gopher doing something. Here's Richard Nixon partying down again. Then we have our final few, which in general did not come out as close as we got for the last ones.

Zane Hamilton:

He has hooves.

Forrest Burt:

It looks like he has given us his signature wave there actually.

Zane Hamilton:

Wow,

Forrest Burt:

That is fantastic. That is the DALL-E mini model. I will go ahead and start running some of these additional requests. I think our plan is to follow up with a post or something like that, that will show off some of the results that we got. I will fire off some of these, if we want to maybe hop over to some questions or something like that, that was perfect. Thank you all for tuning in. That was DALL-E mini running on Fuzzball.

Fuzzball Compared to Slurm [29:48]

Zane Hamilton:

Thanks for putting that together for us. It is also nice to actually show using somebody else's model to run. That is great. Thank you very much. All right. Let's start having some questions. I think there have been quite a few come through here. I know Greg has been answering some of them as they go. I am trying to go back as far as we can here. We are talking about schedulers. Does Fuzzball offer scheduler services similar to something like Slurm?

Gregory Kurtzer:

Kind of. The way it does scheduling is very different from Slurm because it is more like an orchestrator than just a straight scheduler. In the sense that it can do services, it can do other types of processes, but it is very much like Slurm as opposed to more like an orchestrator and the fact that it can also do rich policies and allocation of very specific hardware configurations. It can do it and is NUMA aware. It can do multi-node processing and so on and so forth. It can do a management of consumable resources and it knows when those consumable resources have been consumed again, more like a Slurm scheduler in that case. But again, it is running as a microservice entity within our microservice stack.

The general architecture of Fuzzball you can kind of envision it as that there is two clusters running in one cluster. There is the management side, which is running Kubernetes. And on top of the management side, we have a bunch of orchestrating services that run. We have an image service, a data mover service, a volume service, scheduler service, and all these different microservices that are all working together and able to scale up and whatnot independently in traditional microservice style. Then we have the compute cluster. Now, the compute cluster is running a very lightweight container hypervisor or runtime VM hypervisor. You can ask Fuzzball and this lower level, by the way, is called Fuzzball Substrate. You can ask Substrate, for example, I am going to need 10 GPUs. I am going to need 30 cores, and I am going to need this much memory. If Fuzzball's substrate has that available and can actually allocate that for the amount of time requested, it will respond and say, here is a lease ID.

Whenever you want to use that resource, just send me this lease ID like a token. Then, we can go ahead and schedule that. Now, when you have thousands of substrate instances, you have to manage them now with Orchestrate, which is the next level up in the Fuzzball stack and Orchestrate is that microservice platform. You can think of it again as running your Orchestrate instance on top of anything from Kubernetes to VMware Tanzu, to any one of the Kubernetes stack. We have a version of Kubernetes that we put together. That is super easy to install and facilitate that deployment. You can very easily kind of deploy that management cluster. Then you can run the compute cluster using Warewulf for example, or whatever you want to use to provision your resource. Then that cluster can scale independently.

Now when you are running in the cloud, it is a little bit different in the fact that that Orchestrate cluster is going to run on whatever cloud service provider is offering as a Kubernetes stack. You can run your own Kubernetes stack in the cloud, but it is easier just to use whatever they are providing and you can run Fuzzball Orchestrate on that stack. When it is not doing anything, that is the only thing running. It is a very lightweight small stack sitting on that Kubernetes resource. As soon as jobs start coming into that Fuzzball Orchestrate instance, it automatically provisions the compute resources that you need. It will automatically scale up and scale down. What you just saw in today's demos is I believe we ran everything up in one of the clouds. I don't even know which cloud it was, probably AWS or GCP, but one of the clouds, and when a workflow was submitted into Fuzzball, it automatically provisioned those cloud instances to then go and run the job.

It tears those cloud instances down when the job is done. It gives you that ability to elastically scale up and scale down as needed, as opposed to when you are running on-prem and you could theoretically run out of resources and then you have a job queue that develops, and that queue is now managed according to priorities. Just like you would see in Slurm. Not something you would see typically, more like out of a Kubernetes scheduler. I kind of used this question to actually touch on a number of questions that actually came up in the chat and to talk through all of those. So sorry for going on so long, but I was actually answering a few different questions there.

Zane Hamilton:

Nope, that is great. I was scrolling through, I think that answers Todd's question as well.

Dave Godlove:

Can I jump in and underscore one of the things that Greg said?

Zane Hamilton:

Certainly.

Dave Godlove:

He covered so many things. I am fairly new to Fuzzball and I am still looking at it with fresh eyes, as is Brian. One of the things that blew me away about Fuzzball when I first started looking at it and understanding it, and Greg mentioned this, but I just want to underscore it is Slurm, a traditional cluster, you have all these nodes sitting there spinning doing nothing or working on other jobs but they are all sitting there. There are all these resources. Then Slurm says, oh, okay, you have a job you want to run on these resources, let me go look at them and see what they are doing and then run your job. But that big pool of resources is always there. The cool thing about Fuzzball when it runs in the cloud especially is that there are no resources there at first. Fuzzball says, oh, let me build your cluster for you and based on what you need. Then it basically builds you a custom cluster out and says, here is your new cluster just for your job, runs everything, and then tears it all back down. That is super cool.

Gregory Kurtzer:

There is another facet of this, which we have not officially announced. This is, I guess, a little bit of a soft leak that we are putting out there, but there is a lot of interest recently in composable hardware spec specifically around CXL, you know PCIe switching. Imagine if there was a way that a workflow can come in and say, I do not need 10 GPUs. I need 24 GPUs. You obviously do not have that in one system, but using some sort of something like a composable switch, we can now actually build that resource and set that resource up. Just like in the cloud, spin that up with that specific hardware configuration, then run the workflow on it. Then again, tear that back down and then put all of the resources back into the pool.

It is quite capable in terms of how you would do things like this. A lot of that is because we did a couple things when we built and architected Fuzzball. One of them was we took what was working in the HPC community, and we took what was working in the enterprise cloud and hyperscale communities, and we basically picked and chose what is the best for everything that we need to do across the entire ecosystem; let's pull those together. Then let's further innovate on top of that to give us the best platform that we can possibly imagine for doing these sorts of computing tasks. Fuzzball is the result of that. That is our version of what the best HPC platform could look like. One of the motivating factors for this was, you know, we have been building HPC clusters pretty much the same way for the last 30 years.

The Beowulf design has been incredibly fantastic for us to create HPC systems and drive science, research, innovation, and build these sorts of capabilities and scale them up. Fairly recently, we started to see a lot of new types of innovations, again, coming out of enterprise cloud and hyperscale in such a way that they were not really compatible with how we have been doing HPC systems. We have also seen a much greater increase of diversity of workflows to the point where this long tail of science that we used to call it is now actually starting to become the lion share of the workloads, the majority of these workloads. We are starting to see that traditional HPC application, those tightly coupled highly parallelized NPI focused applications are now no longer the dominant and majority applications running on this system.

All of this together really kind of puts right now as the perfect time to start thinking about how we modernize our HPC environment? How do we take a better look, a new look, a fresh look at cloud? How do we merge cloud and on-prem in a way that absolutely makes sense? You are running the right workloads where they need to be where they need to be run. All of that is what went into Fuzzball and what we were thinking about when we did this.

Fuzzball and Warewulf [39:36]

Zane Hamilton:

Yeah, the next question. Great question too. Does Fuzzball replace some of those capabilities for provisioning like Warewulf or is it complementary?

Gregory Kurtzer:

Great question. Warewulf is, just for anybody out there who is not familiar, Warewulf is a cluster management and provisioning toolkit. It was created back in 2001 when I was at the Department of Energy and it is still actively maintained today. It is still highly utilized today as this open source cluster toolkit. It does a few things somewhat differently than how you may be familiar with managing systems at scale. It works on the concept of imaging. Instead of managing operating systems on the end number of nodes, you manage an image that goes out to all of those nodes and those nodes boot on, in a dynamic sense. Every time you turn on a node, it automatically does a net boot and is being provisioned, this image that you have defined for a single node or thousands of nodes, and you can use the same image for thousands of nodes.

That makes it very advantageous for doing clustering and managing clusters with something like Warewulf. Now, Fuzzball sits on top of Warewulf, generally speaking, just like Slurm and/or Kubernetes; if somebody wanted to provision a Kubernetes cluster statelessly, you can do that also on top of Warewulf. Warewulf is the tool responsible for managing the operating system on all of these resources, but what you run on top of that operating system is completely up to you. In this particular case, Fuzzball would not replace anything in Warewulf, a matter of fact, it is 100% complimentary in the sense that Warewulf would be kind of booting and running the operating system that is now running the Fuzzball services. It works very, very well together from that perspective. You can use Warewulf both for the compute portion of the Fuzzball cluster, as well as the orchestration in the microservice portion of the Fuzzball cluster. However, running Kubernetes again on stateless is a little bit different, because most people think of running Kubernetes on stateful resources. Few things you have to do there, just in order to make Kubernetes work properly and feel at home on a stateless system, but it absolutely can be done. That is one way that we do deploy Fuzzball for customers today.

Zane Hamilton:

Thanks, Greg. There was a question from Mystic Knight. Welcome back. It is always good to see you. He is asking if there is going to be an IDE plugin that has a menu bar for workflow management in real time?

Fuzzball and VS Code [42:22]

Gregory Kurtzer:

I am going to open that one up to, I think Forrest, you have been thinking about this and maybe as well others, but Forrest, tell us how you were using Fuzzball from VS Code and maybe what are some of the plans and ideas around that?

Forrest Burt:

To elaborate on VS Code a little bit. I use that a lot for working on generating my workflows, that type of stuff. It has an integrated terminal that I can use to run Fuzzball commands right from the IDE. It basically makes it an all-in-one easy text editing and working on workflows in that way solution. Some other things that we are exploring around that. I think we are still situating exactly what our plans are going to be to integrate a Fuzzball plugin for that type of thing.

On the workflow side itself, something that we are exploring is the ability to take VS Code and run it as a workflow, so you can get a VS Code terminal into a compute node somewhere. That is something that we are looking at. Then we also have of course our GUI that we are working on, that is complementary to the CLI that you have seen Brian and I demo, that most of the functionality that you see there on the CLI will be available through. So we’ll be able to do some of that workflow design execution management, that type of stuff, through our GUI system as well. That is one kind of big solution there. I know we are still figuring out what exactly IDE plugins themselves will look like for Fuzzball.

Gregory Kurtzer:

I have heard talk from the engineers about doing a plugin as well. I think there is already one for Apptainer for now.

Forrest Burt:

There is.

Zane Hamilton:

I think this is a great idea too. Forrest, will you ever put that demo into a tutorial document?

Forrest Burt:

Yeah, we can definitely put that out there. I want to point out that it is an open source model. If you just look up DAll-E mini GitHub, you'll get the GitHub for it. Like I said, my modifications there have been to move their Docker based container over to Apptainer and then to take their code out of the notebook and reconfigure it a little bit, so it can save off the images instead of just displaying them into a notebook. All credit for the model and stuff like that goes to, I believe his name is Boris Dima or something like that, if you look at his GitHub there. Can definitely make available some of the tutorial materials for people to mess around with. Like I said, I modified it a little bit, pretty simply, to make it take in input on the command line and then to make it just save off those images. That is something that we can, I am sure we can make available.

Zane Hamilton:

How long did it take you to get it to work, Forrest? When you started to play with it?

Forrest Burt:

It took a little bit. Most of the stuff I ended up doing was just code level debugging. As far as the actual Fuzzball part of it itself, once I had the image going and the script itself all sorted out so there were not any errors with that, it was essentially just as simple as dropping it into a Fuzzball workflow and then figuring out what resources it took to run optimally. The workflow creation process was fairly simple. Like the scripty bugging that took a bit, but dropping that into a workflow and getting that workflow running was pretty simple.

Gregory Kurtzer:

Of course, the bigger lift would be getting a Fuzzball cluster running. This is something that we are planning to put up as demo instances on the net, so people can actually use this. Timing for that is still a little bit further out because that technically is a cloud resource, a cloud platform. That is a little bit further out. That will be something that you are going to see here pretty soon.

Workstations [46:31]

Zane Hamilton:

Very cool. Thank you. Thomas Knight has another question. I think we talked about this a little bit last week in the round table, but given that university labs have a lot of high-end workstations, can you federate over a campus, multiple campuses, to those high-end workstations? And can you put them into a single cluster during their idle times?

Gregory Kurtzer:

Yes. In terms of workstations it is a little tricky. Most people are not building HPCs out of workstations. That is something that was actually kind of a big deal. A little while ago, as a matter of fact, it was one of the reasons why a lot of people liked the name Warewulf, because they can convert their labs of workstations to a cluster at night under the full moon. That was one of the reasons why a lot of people liked Warewulf and it actually went out and was very popular because of that, because Warewulf is completely stateless. You can actually have locally installed Windows systems, have them boot to PXE first, and then when you reboot those systems, when you turn on your Warewulf control service, they will automatically all turn into cluster nodes.

There were a lot of people in computer labs doing that. Now, since then, that was again, kind of earlyish 2000s, mid-2000s. Since then most people have been focused on building just production clusters that are dedicated for those sorts of work jobs and whatnot. In either case, if you want to build up multiple HPC systems, I have talked a little bit about Fuzzball Substrate and Fuzzball Orchestrate. Those two are what makes up a single cluster resource. There is another level of Fuzzball, which sits above both of those called Fuzzball Federate, which does exactly what you are asking, which gives us the ability to basically link together and join and unite any number of separate Fuzzball clusters. These clusters do not always have to be just on-prem.

They can be in the cloud. You can have, let's say, Fuzzball resources and different colleges within a campus. You can have different campuses. You can have a Fuzzball cluster in AWS, a Fuzzball cluster in Azure, but maybe in AWS, you actually want a Fuzzball cluster in multiple availability zones to ensure that you can always hit GPS when you need GPUs. They are actually getting hard to get in various clouds. This gives you the ability to automatically choose where you want to run. Maybe that is in Azure, maybe that is in GCP, maybe that is in another campus’ server room or data center. There is a lot of optionality in a lot of functionality in terms of this architecture and what you are seeing and what I think everything we demoed today was just going straight to a single cluster. At some point in the future, we will start really demonstrating and showing off Federate. Right now, we are still focused on Orchestrate and just demonstrating through Orchestrate.

Zane Hamilton:

That is great. Thank you. I think that was all the questions we had. If you guys have any more questions, post them now. Give that a minute. I want to thank Brian. Thank you for putting that together. That was great. Thank you for spending the time, Forrest, as always we really appreciate it. Dave…

Dave Godlove:

I want to say, I really, really love demos. Like this demo that you can visualize and think about, what Brian put together as far as modeling, and you can really understand what is going on on an intuitive level and also just really fun demos, as everybody who knows my history knows that I really like to make sure that there is some fun and some humor and demos to keep you engaged and also, too, they are just fun to work on. I just love stuff like this.

Zane Hamilton:

Absolutely.

Second set of Fuzzball Images [50:47]

Forrest Burt:

Really quickly. I have the last two batches of results. I think the other requests that we had, if we want to close out by looking through those.

Zane Hamilton:

Absolutely. Let's do it.

Forrest Burt:

Once again. This is a Fuzzball workflow, which I engineered to use this model to use Fuzzball, to orchestrate out actually running things with this model. I just want to make it very clear that everything you are seeing here was generated from that workflow and was all run on Fuzzball at an AWS data center on resources, which it orchestrated there. So very, very cool. Let me just go ahead and share my screen and I will show you our last Fuzzball results here.

Gregory Kurtzer:

Could we also post all of these images somewhere? Maybe from the YouTube link so people can actually go and download them and hang them on the walls and maybe make some amazing NFPS from them.

Forrest Burt:

Exactly. Yeah. Okay. This is a diver drinking coffee in a mountaintop cafe, a screaming spider caught in a gopher web, and three dogs playing with an elephant on Mars. Let's see how these turned out. Let's see, here. There it is. Okay. There is a diver, not exactly on a mountain top, but at least a spider and a web. Maybe it is a little too novel to have a, whatever I just said, a screaming, what is it? It is a screaming spider caught in a gopher web. Maybe that spider is screaming and we just cannot hear it because this AI does not generate sound. Here’s a bunch of elephants and dogs. It looks like maybe playing some soccer or something like that on the Martian surface. Here some divers look like they have a cupa at the bottom of the ocean.

Zane Hamilton:

Coffee. There you go.

Forrest Burt:

That is cool. It looks like maybe even have a little coffee cake or something like that right next to it. Here is another spider in a web, but noticeably distorted. Here is a little bit more of the great, I guess, animal conference going on on Mars, a little bit of a misshapen diver there, another random spider, and then more going on on Mars. Then I have the last one. I hope I have them all. I think I have all the requests that came in. I do not think I missed any but this last one right here is a bison on a keyboard, a space shuttle backpack writing on an elephant. I did not have a third one that I could see. I just put in a CIQ branded supercomputer, just to see what we get.

Zane Hamilton:

I see you did not do the Van Gogh style painting of Gregory Kurtzer.

Forrest Burt:

I did not in this round. We can look at a couple of those. I think that we could, oh yeah, that looks really good.

Gregory Kurtzer:

Could we not do me? I am just saying.

Forrest Burt:

Let's see. We have a bison on a keyboard. Let's see, what was this? A space shuttle backpack riding on an elephant

Gregory Kurtzer:

That one is actually pretty accurate.

Forrest Burt:

That is looking good. Yeah. Looks like we have a strap there and stuff. Here is our CIQ branded supercomputer. This looks like CIQ colors. It has a little bit of a green hue to it, this looks like the rack of servers that Jonathon is in, that photo that is everywhere.

Gregory Kurtzer:

You know what I think you are right. We have to bring that up with Jonathon.

Forrest Burt:

Here it looks like royalties. Here is a bison sitting on a space bar, it looks like. Here is another elephant with a space shuttle backpack.

Zane Hamilton:

Maybe backwards.

Forrest Burt:

Maybe it might be a green supercomputer. Yeah, that is definitely, definitely getting a little bit more specific.

Gregory Kurtzer:

Random. I have a lot of presentations nowadays. I think I am going to use that one.

Forrest Burt:

Perfect. Here is a bison in a kind of free form. Maybe a touchpad keyboard.

Gregory Kurtzer:

I think I might have to use that one too. I am not sure what I am going to put into a presentation somewhere.

Forrest Burt:

It looks like we have not only the spaceship, but the whole crew here. One more supercomputer. So not the supercomputers, I guess they are maybe not pretty uniform, I guess when you look at millions of them at once or millions of pictures of them at once, I guess. Cool. Yeah. That is all those. Fun times. Thanks everyone for tuning in and for having some fun with us on those.

Gregory Kurtzer:

I think what I just learned is every time I need an image for a presentation, I think I am going to start bugging Forrest with a bunch of keywords. Forrest, I think it is going to be in your best interest to post that workflow as a template somewhere live. So I am not bugging you every time I am making a new presentation.

Forrest Burt:

Perfect.

Zane Hamilton:

Set it up as a service in the marketplace, so you can just go download your own.

Forrest Burt:

Really quickly. I shared my VS Code screen. I just want to point out just to link it back to what we said about that real quick. Once again, this is the workflow this ran from, these are the logs being printed out of it and all of this execution, for example like starting this workflow, logging this workflow, checking the status of it, that type of thing. As I mentioned, I did that all from within VS Code and their terminal and stuff in there. Very useful, but just kind of a random note to show just once again, how I have been running these commands here. Typically, I pull up a terminal, but just to show you guys where those are coming from.

Gregory Kurtzer:

While you are in the middle of a webinar, we did all that. Very cool.

Forrest Burt:

I appreciate the creativity and the responses that we got that made it very, very fun.

Zane Hamilton:

I also like Dave's last comment, Dave Ingram's last comment about Rocky Linux 9 code name, bison keyboard.

Gregory Kurtzer:

Blue Onyx is the code name for Rocky Linux 9, but bison keyboard could be a good one for the next. Maybe not. It is not really a rock or color, but you know.

Forrest Burt:

Exactly. Most ergonomic keyboard ever, looking through the comments. Exactly. Cool.

Zane Hamilton:

I love it. Well, thank you guys very much. We appreciate you joining us this week. Look for these images to be posted somewhere. We will link to it from YouTube. We will get those out there and Forrest will work on getting a tutorial on how to get this done for yourself so you can waste a lot of time spitting out random images. I really appreciate it. Thanks for your time. Thanks guys for joining us.

Built for Scale. Chosen by the World’s Best.

1.4M+

Rocky Linux instances

Being used world wide

90%

Of fortune 100 companies

Use CIQ supported technologies

250k

Avg. monthly downloads

Rocky Linux