Research Computing Roundtable - Turnkey HPC: Monitoring
Up next in our Research Computing Roundtable series, our HPC experts will be discussing Turnkey HPC and will be focusing on monitoring. We will be revealing why you may prefer a monitoring system like Prometheus (monitoring system and time series database) over traditional tools, how to get new value out of a “modern” monitoring stack, how we help our end-users with application monitoring, and more!
Webinar Synopsis:
Speakers:
-
Zane Hamilton, Vice President of Sales Engineering, CIQ
-
Gregory Kurtzer, Founder of Rocky Linux, Singularity/Apptainer, Warewulf, CentOS, and CEO of CIQ
-
Forrest Burt, High Performance Computing Systems Engineer, CIQ
-
Jonathan Anderson, HPC Systems Engineer, Sr., CIQ
-
Glen Otero, Director of Scientific Computing, CIQ
-
John Hanks, HPC Principal Engineer, CZBiohub
-
Alan Sill, Managing Director High Computing Center, Texas Tech
-
Chris Stackpole, Advanced Clustering Technologies
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Full Webinar Transcript:
Zane Hamilton:
Good morning. Good afternoon. Good evening. Welcome back to another webinar with CIQ. We will continue with our HPC round tables and talk about monitoring. We wanted to bring on a panel of people to talk about HPC monitoring and a Turnkey HPC environment. We'll do a quick round of introduction. John, we'll start with you.
John Hanks:
John Hanks, Griznog. I do HPC in life sciences.
Zane Hamilton:
Excellent. Jonathan.
Jonathon Anderson:
I am a solutions architect with CIQ. I have a background in systems admin.
Zane Hamilton:
Greg. You're next on the screen.
Gregory Kurtzer:
Hi, everybody. I'm Greg. I've been doing HPC for a very, very, very long time.
Zane Hamilton:
Glen,
Glen Otero:
I've been doing HPC with Greg for a long time. Now I'm the director of scientific computing at CIQ and specialize in HPC and the life sciences.
Zane Hamilton:
Finally, Forrest.
Forrest Burt:
Hey everyone. I'm Forrest Burt. I'm an HPC systems engineer here at CIQ. I've been doing HPC for a little bit less time than some of these gentlemen here, but I'm excited to have been doing HPC
HPC Monitoring vs. Turnkey HPC Environment [00:01:41]
Zane Hamilton:
I want to dive right into it. When you look at traditional monitoring tools in an HPC environment, you're looking at things like Nagios, Already Tool, and Ganglia, but what is the advantage of going to something newer like Prometheus? I'll start with you, John.
John Hanks:
The advantage to going anything over Nagios is not to have to deal with the syntax or the config. Many configuration syntaxes get worse than trying to configure Nagios. Other than that, this is a weird area. Monitoring is for me because I don't do a lot of it. When I want fine grain monitoring from a node, I tend to go to the node and collect it in real time. Not try to collect it over time. And then, for the cluster as a whole, the health monitoring Slurm handles that with the node health check. I tend not to do a lot of monitoring, but I don't like configuring Nagios or any of the derivatives.
Zane Hamilton:
Does anyone else have any thoughts on that?
Jonathon Anderson:
I can speak to that specific point. For whatever reason, at some point in the past, I got myself way inside the Nagios config and got to the point where I did not find it confusing or a pain. And so often, I know there are better tools, but I have historically tended to fall back into Nagios because it's what I know. It's what I know how to config, and I know how it behaves. I look at what, especially the web scale people use, a big tool like Prometheus. I've used InfluxDB and its whole suite of tools for a while. I appreciate that at an academic or an architectural level, but I often find it difficult to advocate figuring all that out. When I already understand the Nagios part.
Zane Hamilton:
I can see that. Glen, you've got to have some thoughts on this.
Glen Otero:
Ganglia was my first experience in monitoring. I still have scars. I guess not because it came out of Berkeley or anything like that. I was thinking about this topic, and it seemed to be Ganglia seemed to be enough to do enough. But its ancient interface and inability to give you a picture of hundreds of nodes, for example, is where I was running into limitations. Interfaces that could show you compute nodes with a color code scale. For example, a really small amount of pixels. You could see 128 nodes or something like that I thought would be the future. I think things like Prometheus, another cloud scale type of monitoring tool, are more effective.
Glen Otero:
Also, when you can put things into a database, whether it's influx or elastic, and then play things back. You can play back to what your cluster was doing over a certain time. That is helpful, particularly for trying to debug why someone's job affected all the nearest neighbors. I'd like to hear what other folks think about what they find important in a monitoring tool. Is it just the CPU? Is it percent memory being consumed or other.
Zane Hamilton:
Forrest, coming out of a recent academic environment? What did you guys do?
Forrest Burt:
Thinking about this in the run up of this meeting to ensure I had the answer to this. But it's murky in my memory. If I recall correctly, we were in Nagios and Ganglia shop. I can see the charts we had in our heads, or I can see the charts that we had to monitor in my head. I think we were on from Grafana. We were using the Nagios and Ganglia stack. I only have a few specific thoughts there. We were a small cluster and worked for our purposes. We were mostly looking at the standards of what you would expect CPU, that type of thing, trying to watch out for jobs utilizing nodes that aren't spinning on them at all.
Forrest Burt:
As far as this went, some of the stuff I did on the side was creating the custom scripts to parse what Slurm would give you about different bits of info. I could see all at once what nodes had stuff scheduled on them, all that type, without putting in four or five different CRM commands. I had some stuff that I had set up just for my custom thing, but for the most part, as I recall, we were Nagios and Ganglia about what you'd expect.
Zane Hamilton:
Thank you, Forres., Greg, I know you've got some thoughts on this area,
Early Days of Warewulf [00:07:15]
Gregory Kurtzer:
First, there's a bunch of legacy here that I can talk about. When we started doing Warewulf in 2001-ish, there was a big request. There were many people interested, not just in provisioning but also management. There was a piece of Warewulf which did monitoring. I did a Google search because I don't have any screenshots of WW top running anymore. I wanted to share something so everybody could see what it looked like.
Zane Hamilton:
Let me guess, it looks like the top.
Gregory Kurtzer:
Yeah. But, instead of looking at independent systems, it was a real time monitoring of your entire cluster, and you can sort by each field. This one's sorted by node name because you can see it's bold. You can go through and take a look at your entire cluster. You can also sort according to idle, uptime, and all these things. You can find out what systems are dead and what systems are alive. This was never hooked into the resource management system and whatnot. But, with the appropriate configuration, you can easily link it into the resource manager and show you nodes running a particular job, for example. This had a daemon process that sat on every compute node, and basically, every second would query the local resources of that system.
Gregory Kurtzer:
Come up with how many CPUs you have architecture and whatnot, and what is your current memory utilization. You can see everything from this person has temp in there as well. How many processes are running, et cetera and so forth, but this gave a really good, real time analysis. Now, we already talked a little bit about Ganglia, and at the time, I'm going to stop sharing now. At the time, Ganglia was the prominent solution being used. One of the issues that other people and I had with Ganglia was using a multicast channel to do all of the updates. You can imagine if you need to become more familiar with multicast. Think of it as broadcasting to a particular group or to a particular port.
Gregory Kurtzer:
You can do something like every Ganglia daemon that's running on every single node listening and broadcasting to every other node in your cluster. This means every node in some time, dozens or hundreds of times a second, is hearing about every other node in your cluster. Now, it was convenient for smaller clusters because every node always knew everything. You always had backup data, but I'm not sure how relevant that backup data was, but every node knew everything. It was never the major benefit, there was no configuration. You turn it on, jump into a multicast channel, and start broadcasting and receiving data. And then you can put a web front end on it and whatnot and make it really pretty.
Gregory Kurtzer:
But, when you're dealing with bigger clusters, you don't want all nodes listening and hearing and parsing XML from all other nodes on every single status change of that node. It was kind of a heavyweight solution. So, what we did differently with Warewulf is it was very lightweight. And even though it did create a context switch once a second and it interrupted once a second as it was collecting resources and collecting current status and then shooting that over the wire, if certain things have reached certain thresholds of changes. It tried to be as optimal as it could on that, but there's no getting away from it. You're still caught at causing contact switches, you're causing operating system noise and jiffies, and if you're doing tightly coupled HPC, every single thing you do is going to have an effect on the overall job and performance of the system.
When dealing with bigger clusters, you want only some nodes listening and hearing, and parsing XML from all other nodes on every single status change of that node. It was a heavyweight solution. So, what we did differently with Warewulf is it was very lightweight. And even though it created a context switch once a second and interrupted once a second as it was collecting resources and collecting current status and then shooting that over the wire if certain things have reached certain thresholds of changes. It tried to be as optimal as it could on that, but there's no getting away from it. You're still caught at causing contact switches, you're causing operating system noise and jiffies, and if you're doing tightly coupled HPC, every single thing you do will affect the overall job and performance of the system.
Gregory Kurtzer:
We tried to be as lightweight as we could. We sent a UDP packet with everything we needed for that node directed to particular head nodes. No other node heard it, but this was still incredibly rudimentary. There's so much more that we could be monitored effectively. The question is, how do we do it effectively, and how do we do it that's not redundant with the resource manager? The resource manager is typically already doing some amount of monitoring. How do we augment that with the resource manager without adding any additional load or resource consumption to that system? And that's my big perspective on monitoring: how do we get, what is the information we need? How do we get that information without causing latency, bottlenecks, and whatnot? And then how do we portray that and visualize that in an effective way to users?
BMC’s: Time Series Databases and Organizations [00:12:41]
Zane Hamilton:
I think it's interesting that Glen brought up being able to go back and replay stuff that had happened in the past to troubleshoot. But then John, maybe you don't have that need, or it's not relevant, or it just takes too much time. From your perspective, John, why wouldn't that be something you would do, or is it just too much to deal with?
John Hanks:
Back up to why you're going to monitor, I have a cynical view of this. If I'm going to monitor because I'm assisting and I need to troubleshoot stuff. I would rather do that in real time, run the application, and debug stuff in real time to see what's happening or look at the cluster with a workload on it to see what's happening. I wouldn't need playback, but if I have a boss who loves to click pretty things into Gooey, that's a fantastic feature to keep them busy. They're not bothering me while I'm trying to do real work. In that case, I would like to have playback. Why don't you see what the Collect's doing a month ago and leave me alone? That'd be a great feature for me to have.
Zane Hamilton:
.
Excellent. There's a lot of chat going on. Alan had posted in there what they're doing from the BMCs. And they're collecting that from a time series database and organizing things. There are a lot of opinions that are going through. Thank you all for chiming in.
John Hanks:
The BMC is a good point. The only way you can collect metrics from a node without impacting the node is to collect them from the BMC. Anything you do in the OS is going to cause a problem. That's the last time I used Ganglia. That's why we stopped using Ganglia because it severely impacted open phone performance. We just dropped Ganglia from the cluster completely and stopped doing that when I first stopped collecting metrics on nodes.
Zane Hamilton:
Raymond just pointed out that they have Ganglia monitoring 2000 servers and had to have a dedicated server to do it. He says that's very cool. How could we help people get more value out of a more modern tool, or is that something we should even worry about? Jonathan, I'll go to you first picked on you in a minute.
Jonathon Anderson:
I've struggled to find a best practice or best in class and am still fully open. I'll say because that's a preference for me anyway. Monitoring stack to point people at and say, not only is this good, and it's representative of what people who know about monitoring would recommend you do today. It is also not complex to get deployed because I've run into many elastic searches as a sync for data. The last time I tried to do an elastic search deployment, it had many licensing encumbrance issues. But it also very opinionatedly wants you to have a three node cluster to store your data because it expects you to do web scale work. Why would you ever have a single point of failure in your monitoring system and find that? That middle ground where you want a good solution that you could scale up in the future but gets you up and running quickly and demonstrates its value early is something I've frankly struggled with. I don't know the answer to it yet. It's one of the things I wanted to get out of this conversation.
John Hanks:
If you think about the stuff you want to collect, you're talking about enough ints, maybe enough doubles to fill a jumbo frame in ethernet. Suppose you sit out at the beginning to say. In that case, I want a monitoring system that will update my information every two seconds, every five seconds, and give me enough data to fit in a jumbo frame and set that as a constraint. If you can only send a single jumbo frame full of information by definition, you will wind up with a pretty efficient collection system. You've limited this to something: if you drop TCP and do it UDP and accept it to be lossy, you've made that thing super efficient. That's what I would like to see. If such a tool like that existed, I would probably be interested in monitoring it again. I have no desire to set up Prometheus or Ganglia, never again. I will do some stuff with Grafana, but usually just from my Slurm controller, dumping a few minor things out of Slurm to see what Slurm's up to.
Gregory Kurtzer:
You're describing what Warewulf was, the WW top, and the Warewulf monitoring system. But my question that I would push back is because I last developed that literally a decade and a half ago. The last time I used one in my HPC systems was at least a decade ago, maybe even a decade and a half. What information do we need to expose via monitoring to users and system administrators? What's missing right now?
John Hanks:
There is nothing I want to expose to users. They can engage Slurm profiling and get an HDF five file with these fine grain metrics on their multi node job, whatever they're running as they could want to do. For user troubleshooting between SSH and the node, this is what the built in Slurm profiling does. There's nothing that I want to show those people. This would be purely for me as a troubleshooting tool to see what's going on cluster-wide.
Gregory Kurtzer:
You're a system administrator, but what about those people? What about those users? What do they need to be more effective, and is it already there? You brought up profiling, profiling is critical, but most HBC users don't spend much time profiling. It's like, what do we need to be exposing? Sorry, Jonathan.
Jonathon Anderson:
One of the things that I've seen be useful in the past and wish that we had more data on it is the time spent in File I/O and especially differentiating actual data read and write from metadata. The characteristics of different file systems are different. Whether it's a clustered file system or a remote file system locally, the biggest overlap we've had was where we took some systems monitoring and improved the user's life. If they adjusted their File I/O and I/O pattern, they would get drastically better performance for their overall application. But, that's difficult to visualize and collect in my experience often with the tools I've used.
Zane Hamilton:
Alan keeps putting stuff in there. I wish we could have Alan pop in to tell his story. They're doing interesting stuff and collecting data to go back and do replays, multi note NPI, and even memory and CPU. So memory, CPU, and power consumption. That's thank you for adding all the color on. We appreciate it. Steve's adding some stuff in.
Forrest Burt:
I was going to say that it's interesting what he is talking about there, pulling the info based on native Slurm, APIs, and the BMC network. That's pretty interesting.
How Are Users Managed and Monitored [00:20:35]
Zane Hamilton:
Steve asked the question earlier about how users are managed and monitored from allocation, utilization, or scheduling. Being able to monitor an individual user, are they consuming your entire cluster? How is that being done, or do people even pay attention anymore?
Forrest Burt:
Slurm provides a lot of that type of functionality. For example, in the cluster I worked on, we had to address one specific thing, can the user use the entire cluster? We had something set within Slurm that would only light a request up to eight nodes at once from a certain queue. A lot of sites would have stuff built in natively with Slurm that would allow them to keep track of, I mean, SQL as the classic command that shows you everything running, everything pending. That's all controllable through Slurm natively.
Zane Hamilton:
So policies. Setting policies of what they can mostly do.
Forrest Burt:
Yeah.
Jonathon Anderson:
For that accounting information, I often see people deploy a tool called XD mod. That came out of the exceed project. I know NSF at least liked it when you deployed that and liked the reports that came out of it. It made more sense in a proper EXCEED environment. We were also using it as a tier three site at one point. I never got familiar with it or adept at using it, but it integrated with Slurm, pulled in all that accounting data, and let you show multi-user and multi-account usage graphs over time.
Forrest Burt:
Earlier, when I said I couldn't quite remember what we used at Boise State, I had X at the front of my mind. But I was wondering if it was just because I'd been messing around with a lot of X, 11D bugging. XD mod was something that we used and produced a lot of graphs and stuff like that. We were able to serve back out. I know that's still popular out there.
Zane Hamilton:
Greg, any thoughts on this before I change topics?
Gregory Kurtzer:
Still really interested more on the user side. John is looking at it from the system administrator, not to put words in your mouth. You're looking out from the system administration perspective. Jonathan brought up the storage perspective and identified bottlenecks in the storage via I/O locks and whatnot. That's incredibly valuable as well, but that's not user-related. When we first did WW top, the reason was because users wanted to know exactly what their applications were doing. Are they effectively using the cores and the memory of that system? What are the holes now for users? And, are there any problems that we still need to solve? Or are there tools that effectively do this again on the user side?
John Hanks:
Our users can SSH, so they should know where the job is running and run top, htop, atop. I haven't ever given them access to I/O top, but I should figure out a way for them to run I/O top. Any tool they can run can SSH to a node with a running job and run that tool. I have some users who would be happy to have that in a web interface. The ones that can understand what they're doing when they're tuning something, and profiling troubleshooting would prefer to have SSH on the node and work on the thing directly.
Jonathon Anderson:
Is it true that we have it for users to have multi-node, like NPI workloads? That's where I expect the high level application profiling paradigm through node monitoring would fall over, but I've never been an NPI developer. So I don't know.
John Hanks:
I don't have those users, but back when I did have those users, which was ten years ago or more, they still SSH to the node. People wanted to run things like there was debugging. There's a debugger tool that I used to use by TotalView. They were other debuggers that would do parallel debugging. They just wanted access to start that debugger and SSH to the nodes and let the debugger do the work.
Zane Hamilton:
Forrest and I have talked about this quite a bit, and it's an interesting topic. We don't have it on the outline, so it will be new to most people here. And I want to get reactions and thoughts because it's an interesting topic. Again, Forrest and I have talked about this back and forth quite a bit. What happens when we start throwing things like AI into this and having HPC have AI associated with it and being able to help with this in a more artificially intelligent way? I'll let Forrest dive into what we talked about, what he's seen, and his thoughts.
AI Associated With HPC [00:25:46]
Forrest Burt:
We focus a lot on the human side of monitoring and providing something that a person can sit there and look at and monitor and derive insight from. To my knowledge, there's never been any effort to do AI driven analytics on a supercomputer. I'm talking about things like intelligence scheduling. Instead of being governed by an algorithm, something is looking and trying to decide where things fit in best instead of just having an algorithm. As clusters become very, very big, we stop viewing them as collections of nodes and more like a single industrial machine that is being leveraged for a certain purpose rather than a collection of nodes.
Forrest Burt:
It would be useful if we had an all cluster way of analyzing these very large streams of info from these very large clusters at once. Part of the answer there might be AI to derive those types of analytics from the data being generated on these clusters. Of course, some of the first things there are is there a data set out there that we can look at a million scheduled jobs and begin to train AI over that type of thing. This is something that hasn't been explored. Many of these solutions are pretty manual and require a human operator looking at them. What about AI driven monitoring? What about something we have designed to look at a cluster and see where things could be more efficient and improved? Are there any efforts going on with that? Any thoughts on connecting AI with supercomputer management?
Glen Otero:
I don't know if any efforts are going on, but as you describe this, it brings my head into is if we can instrument whatever's spawning the job so we can monitor the job in a way that we can pull in that context and feedback back into a model. We can start making intelligent decisions or recommendations as to whether; this is a good hardware for you to run this job on. Are you sure you don't want to give more memory to this job? Things like that, are you sure you want to run on this architecture? You may want to run on a GPU which is available right now. We can glean a lot of intelligence with the appropriate monitoring, monitoring, and modeling of that data. A hundred percent. Yes.
John Hanks:
That would be dependent on the type of AI. If you use modern AI that we're all doing on GPUs now, it's just pattern recognition. I need help to think of a pattern in my logs or any metrics I might produce that I couldn't just grip myself and write a simple rule for. I wouldn't need to train a model for that stuff, but for what Greg's talking about, that's like the old AI stuff where you would do a tree search to take a derivative of something to solve a calculus problem. And in that case, once you know the rules to produce a decision tree from that, somebody can say, I'm going to run this job that does this. And then, that decision tree produces a valid suggestion for them. There is a ton of use for that to present stuff to user space in a better way. People would get more optimal use out of it by being able to follow that decision tree.
Zane Hamilton:
That's great. Thanks, John. Jonathan.
Jonathon Anderson:
I have two primary experiences with this, reflecting different parts of Forrest's upbringing. One was a project at a past site where I was that ended up being someone's graduate thesis. At the time, the luster environment for that system was pretty unstable and kept falling over. And particular user workloads would cause the luster to become unstable and crash. They were looking for ways to, beyond trying to figure out why it was happening in the first place, they were looking for ways to predict that it would happen. There was an effort to dump all of the high verbosity luster logs and file system telemetry and then train an AI model to see if the AI model could predict when the luster would fail.
Jonathon Anderson:
And my understanding was that it became better than even odds. It was an interesting academic project that was something other than what would be particularly useful. At least not from the data set we had fed it, which was a single cluster system with a single set of user workloads. You could provide or develop something like that, that you could train up for a given environment and that it would learn the kinds of things that happen in that environment. But in terms of generalizing it for any luster environment or any cluster. It at least wasn't part of the effort. The other thing that I've seen are people trying to do AI enhanced scheduling. All this comes from, ultimately, if users were truthful or fully accurate about what their job was going to do ahead of time, then you could schedule the system perfectly, and you wouldn't need something like AI.
Jonathon Anderson:
They don't. They specify friendly values for wall time, memory use, CPU requirements, and things like that. Their requirement definition needs to be more precisely tuned to what their job will do in the real world. I've seen AI-driven attempts to take job history and see what jobs did versus what they said they were going to need. Then they would make different scheduling decisions based on historical patterns and over subscribe a node based on what the jobs requested and try and do it based on what the jobs are going to require. For example, maybe they requested more memory, but the AI model has learned over time that this kind of job tends to use half the memory it requests and schedule the resource better that way.
Jonathon Anderson:
And you can get interesting results that way. I see two problems with it. One of which is that the impact of a failure of the AI model is very high. If you say you needed 10 gigabytes of memory and then over provision to that on a note, and your job fails because you needed that memory and the AI model thought you wouldn't, then that's a bad outcome. It would be bad for the scheduler to create a job failure that would not have happened. Otherwise, the other is that it's not just an automated system, and getting users to trust the scheduler is already difficult enough. One of the most often or most frequent questions we would have is, why is my job not running yet? We could detail how the fair share scheduling algorithm works and why their job isn't working by plugging all the values into the formula that determines job priority.
Glen Otero:
Over time we could, without fail, show them the historical records and say, you have been allocated this share of the machine. They have been allocated this share of the machine. Over the past month, you've got exactly that. It's scheduling to produce that outcome, but that's with a very relatively speaking well-characterized algorithm that you can point to the values that go into it and say, this is why it's happening. That way, the minute it becomes a fuzzy AI model that you must trust is making good scheduling decisions. It would become even more difficult to get the users to trust that the scheduling is fair or correct. The worst case is when you are scheduling, all become reservation based, and everyone it's just reserving time and the cluster. You lose out on all of the benefits of batch scheduling because this is my time on it. Then I have it all Monday or that kind of thing. That when people stop trusting the algorithm, that's what they start doing is trying to get dedicated time to it instead.
Zane Hamilton:
Glen, you're our AI guy around here. I wanted your thoughts on this. And then we'll go to Chris. Thanks for joining Chris.
Glen Otero:
The AI could be used to make better suggestions to what someone was saying oh, we're going to submit this job. Do you want to run it here? Historically it runs better on a GPU. Maybe, you should think about scheduling on a GPU, or I think that will be useful, but trying to beat the current algorithms at backfilling and things like that, I think, is probably a bad idea and would lead to a lot of chaos.
Zane Hamilton:
Okay. Chris, introduce yourself if you don't mind.
Chris Stackpole:
Chris Stackpole, I've been doing HPC for a long time, and currently, I work for Advanced Clustering Technologies. We're a systems integrator and builder for people who need HPC systems.
Monitoring With Zabbix [00:35:35]
Zane Hamilton:
Thanks for joining. We're talking about monitoring, and I hope you have some thoughts on monitoring and any of the topics we've covered so far that you want to touch on.
Chris Stackpole:
I do a decent amount of monitoring. I've been working with Zabbix in some capacity since about 2007. I do like that for monitoring, especially for HPC systems, from an admin perspective couple of different reasons. I monitor anytime I have a problem. I monitor it, whether that's as simple as an SSL cert going bad or a particular use case where we had a user who, in one aspect of their job, was writing so many small files that they would tank her storage solution. I mean just hundreds and hundreds of thousands of small files fast. Monitoring those situations and then with Zabbix, I can add rules to perform actions. If I notice, we had this one application run that could have cleaned up after itself better.
Chris Stackpole:
And every once in a while, when it got into a weird, stuck state, it would chomp memory until it just took over the note. Look for these conditions and then reset that service and have that do it for me automatically. That way, I only sometimes have to go in and mess with the different nodes. Somebody earlier mentioned user performance data and Zabbix being able to create dashboards for users so that they could see my CPU usage. Here's what my memory usage was. Things like that helped, especially for users who didn't know what they were doing. They only have a little skills. Many of them are connecting and just writing a script or something. For the entry level ones, I like to see level graphs. Not only just from monitoring but how the system's being used. Can I address problems before there problems that tank the system and have that done automatically to even provide useful information to a user for what their jobs look like?
Zane Hamilton:
That's great. Appreciate it. Thanks for collaborating on that. Chris, do you have any thoughts on AI HPC modern?
Chris Stackpole:
I'm still determining if I've got a ton of those. There was an article on the news, but it may have been Slashdot recently talking about the downsides of using AI for two small projects. Your data set needs to be bigger to have meaningful conclusions. There are a lot of valid points in that. I can understand why somebody is a major company. That's got a lot of data they could churn through to ensure that their data that AI is acting on good data and has seen all these little use cases. The vast majority of the time, we see many small problems creep up, and as CIS admins, we're usually good about saying, Hey, this is one problem that keeps coming up.
Glen Otero:
.
I will write a script or something to fix it for me, and then we will move on. When you see these oddball problems, that's a good chunk of our job. What happened this time was that I had to go figure out something new. It gets easier with a lot of AI if they've seen much of the data. Unless you're talking about the size of exceeding if you're pulling in all the clusters from exceeding, you could probably get a good idea of what use cases are for generating problems.
Forrest Burt:
To touch on what I said earlier, one of the biggest things about this is that I don't think any unified HPC data sets could be looked at, but I have attempted to aggregate data of this nature. Jonathan touches on cluster performance. That'd be a huge effort to get that from many different sites. Much tends to be a problem with new things and machine learning. The lack of data would prohibit that development for the moment because there needs to be a unified way to get all of the exceeding to put their logs into one thing, for example. Whether that's even possible is another question entirely.
Zane Hamilton:
Sure. We've had Alan join. Thank you, Alan.
Alan Sill:
Hey Folks.
Zane Hamilton:
Short notice, pop in. Appreciate it.
Alan Sill:
Guys doing a good, so don't take any criticism from my previous comments.
Zane Hamilton:
You've been posting many comments, and it's an interesting segue. You have a great story, and you're doing something we weren't thinking of or looking at. If you don't mind, elaborate on what you are doing.
Precursors to AI [00:40:25]
Alan Sill:
First, I'll go back and reinforce the point made earlier by someone else in the chat that reminded me of when we were building the databases for the first distributed computing at CDF in the early days of the grid. We had that database administrator say, keep your freaking monitoring off my machine. I will tell you when it's working right. I fully endorse that point of view if you know what your machines are doing. This is the modern day and age. You're, we're spinning off clusters as cloud jobs. We have to have automated systems to keep track of them. We've been working on what I call precursors to AI. There's a lot of discussion about AI in the data center. You can understand the marketing motivation for this language.
Alan Sill:
First of all, not a big believer in AI at all. I'll put my biases right on the table. I tell people that machines could be more smart. What we have is obscure scripting with undetermined failure outcomes. What we don't want in the monitoring world is any of that stuff you don't want, as John said, pretty pictures for management to click on and get bored at and never look at again. We've been working on these kinds of precursors to AI things that could be automated. You ask yourself why it hasn't been up to now. One very obvious limitation is the need for more sufficient data. As mentioned earlier, people who implement in-band monitoring quickly take it out if they're doing anything serious because they don't want it interfering with your HPC.
Alan Sill:
That automatically limits the rate you can get information, making it useless for computer science purposes. We've been examining the assumptions there. The first obvious thing was baseboard management controllers are way smarter than they used to be. I mean, there are 16 core arm systems with 16 gigabytes of memory. I mean, that's the most advanced one. The smallest ones I've seen are four cores. You get a decent amount of computing done just in the BMC. Then you examine why it is not being used. Well, IPMI sucks as a protocol, so let's not use that. The industry has been busily making that better through various open source projects to program BMCs. The standard Redfish standard from DMTF has a streaming territory feature.
Alan Sill:
Let's turn that on. All of a sudden, you're getting multiple Hertz of multiple measurements per second in a manageable way that scales to thousands and tens of thousands of notes. Now that's all in the links that I put in the chat. I'm interested in what you guys have been talking about most recently. How do we sensibly instrument the code? When we're talking about HPC, by and large, we're talking about MPI code. I mean, computing is very valuable and useful. Don't get me wrong, but how do we instrument MPI code without getting in the way of it? For that, we have to turn back to the MPI community itself and look at some of the stuff being done by AMVA pitch projects, and Livermore has this Caliper tool, but that requires instrumenting your code. We are working on some stuff, this center that I posted a link to is an industry supported center, and we have some stuff I can't talk about yet, but it is aimed at this goal of being non-intrusive and comes with the package. You can do something low quality. You can shove it in a time series database. I'll stop here because I've been talking too long.
Zane Hamilton:
You last talked a little bit ago. That's fantastic. I just remembered that I had to ask you to introduce yourselves. I know who you are. I just started talking.
Alan Sill:
I'm a managing director at the High Performance Computing Center at Texas Tech. I'm one of several co-directors of a multi-university national science foundation funded industry university cooperative research center called Cloud and Autonomic Computing. You always have at least one piece of jargon your boss needs help understanding. That's the naming principle for that center. I wanted to put in a plug for timescale DB. If you haven't been playing with it, if you like Postgres and wish it were better, the time series is fantastic. Open source implementations are very powerful. Yes, you can get commercially supported ones, but we didn't need it for what we're doing.
Zane Hamilton:
That data that you're collecting, Alan, are you going back and doing replays of that data? Are you people asking to go back and do that?
Alan Sill:
That's why we put it together the way we did. We were still determining what would be interesting we'd know. What knows that sometimes nodes become unreachable? Why? What was happening before that node became unreachable? What was it doing? We are still looking for a solution for the file I/O. It's a harder problem than it sounds. I agree with the force that that's immensely interesting. We're looking for a smart way there, but we could look at, with the BMC, you can ask how much power you are spending in this CPU package and how much power you spend on memory. In principle, with the Redfish, we can get the error rates of the HCAs for the network cards.
Alan Sill:
We still need to get that instrument, but one thing about Redfish that people need to understand is that it's more than just a replacement protocol for RPMI. It's how the components of the motherboard talk to each other. They use Redfish internally to compose the platform you're working on. They don't use the HGP parts. The big difference between Redfish and IPMI from a system manager's point of view is that IPMI is UDP and streaming, and not any one message is not necessarily going to get through. Redfish is TCP, and it's deterministic. It suffers from the chattiness of TCP. We were thinking of trying to do clever things with HP three, but then we found that we were able to talk about the standards organization and implementing this telemetry feature. You can tell the BMC. Here's a list of things I want you to tell me about. You check them internally as fast as you can, or as fast as I tell you to, and do a service side push to me anytime you get a result. It inverts the bandwidth and gets rid of the chattiness problem. If you, for no other reason, wanted to learn Fedfish learn it for telemetry.
Chris Stackpole:
I'd like to know more information about it. Because the last time I looked at Redfish was a while ago, there wasn't a great standard for it. If you're trying to pull metrics back for CPU temperature, it will be different for gigabytes versus Asus versus Intel.
Alan Sill:
It's all standard,
Chris Stackpole:
But they have their hierarchy is their hierarchy is different though.
Alan Sill:
But it's discoverable. Like any API, you say that the API, what have you got? And it tells you what I've got, and you say, okay, give me those. It's any particular manufacturer can customize it there. If you want to think of the design structure of Redfish, think of an outline, it's just an outline of what's in your server.
Chris Stackpole:
I think it was that Intel had CPU one temperature but cases would start CPU zero, or somebody else would say, yeah, you have spaces. Or they would have MB temp. They needed to be more consistent in what the actual.
Alan Sill:
Well, neither is the hardware. You go to this neural property and say, what have you got listed under neural properties? You say I want these in any given data center. You'll have a relatively modest number of such things.
Chris Stackpole:
But then you're querying for every level because you have to query each level to know what level is below it. Then you end up with these request strings that are long and different for every motherboard type.
Alan Sill:
Yes. Think of cloud services. Every cloud service out there will probably be different each time you query it. A well designed microservices architecture will use the discoverability features of an open API or Async API to navigate. I have a nice tool if you want. It's in our github.com/nfac called Redfish prompt. Have you ever played with the HTP prompt? It API into an interactive session. You can navigate through it like directories. What the Redfish prompt does is it goes, pulls the open API spec of the Redfish thing you're talking to, and prepopulates that, so you can navigate through it. It has auto completion and stuff like that. The short answer is yes, it's standardized. You have to learn how to use the standard, but the benefit of doing so is that any graduate student can write the code that fetches things. You don't need SNMP traps or any of that crap. I can prove that statement because most of our code is written by graduate students. If you still need to look at the visualization and the links, it's astonishing.
Chris Stackpole:
Yeah. I would like to link to that Redfish tool, if you wouldn't mind, please.
Zane Hamilton:
We will date the links Alan provided and post them with this video, so they will be available to download. Jonathan, you had something you wanted to add.
Jonathon Anderson:
I was going to say that I have been meaning to get into Redfish for a while. I would talk to the sales people at the different server vendors, and they'd be talking up that support, but the same thing I was talking about with Nagios. I know IPMI, and I'm used to using it. I've been stuck with the limitations of that. It's encouraging to hear about. You said the Redfish prompt was possible on ramp for getting familiar with the protocol and learning what it could do. I'd like to know. This makes sense, but what kind of integration for red Redfish might we be able to bake into future versions of Warewulf or something like that? Where it's using IPMI today might be a good way to get into it.
Alan Sill:
Let me put some realism here. It's great. It solves problems talking to BMCs, and the direction we're pushing now is moving away from just the BMC to try to talk into APIs, to talk to the MPI layers. That's our research work right now. That's still being prepared for prime time. The other direction you can go, however, that you can do with Redfish is into the data center, which is you rack PDUs and crack units and power systems and cooling systems. Redfish is no longer just protocol for talking standard protocol. We're talking to the baseboard and managing controller servers. It's a data center API. In my center, we have a lot of work focused on getting automation to the point that we can put and operate data centers at renewable energy and power sources remotely.
Alan Sill:
We're not new to this, but we think we have some special tools to try to help save the planet and stuff. We want to, and also, we have a lot of wind power out here in west Texas. We want you to need to talk to everything in the data center on an equal basis. You can't just say, oh, I can talk to my server a lot, but I have yet to learn what's happening to the air condition. Redfish has got extensive tools for data center automation. Now, I'll put a hyperlink in and stop talking.
Q&A Session [00:53:55]
Zane Hamilton:
Thanks, Alan. Appreciate it. We do have some questions coming in. I've seen several. That sounds good to me. George asks what needs to be monitored in a cluster rather than CPU and I/O traffic. I think, Jonathan, you had some thoughts on that.
Jonathon Anderson:
You have, within the cluster, you have those things. You also have memory over time and watermarks. Sometimes, we've seen smaller things far outside our ability to monitor things like CPU cashed, hits, and misses. Where we saw applications behave very differently on seemingly similar CPUs that were just differentiated by the member, and the amount of level three cache, they had things like that could be interesting. Still, it's one of these games where you can only predict what you need to know once you have the problem, and then you don't have the information anymore. There's a whole bunch of stuff that you could want to know.
Zane Hamilton:
John, sorry, Forrest, No. Anything in particular, John, that you think other than those, those simple things, CPU I/O?
John Hanks:
That's the most critical thing in high throughput computing to monitor storage. Jonathan hit on it earlier when he said monitoring storage would be really good, and there is no solution. I've tried tackling this before and felt miserable, and that's, you're getting to a level where the only thing in the entire mix of tools we're playing with that can answer that question is the resource manager. With Slurm finally supporting V2OC groups, I'm hoping that having jobs and tracking I/O and C groups will give us an end to finding out what each job is doing with regard to I/O.
Zane Hamilton:
Forrest, I think you had something you wanted to say.
Forrest Burt:
We can broaden the category of things we're looking for regarding monitoring if we look at things through an automation standpoint, just like what we're touching on with storage here. Jonathan, you touched on someone's thesis that pulled out some interesting logs from Lester and that type of thing. They were trying to do analytics over that. Alan, it sounds like the system you're putting together provides huge amounts of different analytics that can be looked at.
Alan Sill:
So yeah, that was all good.
Forrest Burt:
I was going to say we may have more conventional needs. Of course, we need to see the CPUs. Of course, we need to see what's running that type of thing. If we did look at automation, we looked at this from the perspective of what small, real-time tuning we can generate from these different analytics that a cluster is generating if we had a way to automatically look at those and correlate them with each other in such a way that a person doesn't have to sit there and munge through all that data. That would broaden what we're looking at with regard to what we want to monitor if something can, once again, like Jonathan said, analyze something like a Lustre log versus the temperatures and look at those different things to find correlations. As I said, real time tuning may be obscure to a human operator but may require a computer's view of all the data at once.
Alan Sill:
I'll add that the Slurm, of course, has this ability to work with Lustre file systems through their job stats. We found that remarkably difficult to turn on and utilize well because you have to coordinate across all your MDs and Lustre servers, metadata, and object storage servers. Then you have to correlate the Slurm job ID with activity on these separate machines and put it back together. We almost had an implementation, but there is some stuff in the liquid tool. I love the definition of the liquid tool. Then TAC, when we were talking to them about this, pointed out that they have an I/O limiting tool, which I'll put in the chat here, that might give people some ideas for a suit here. We are still looking for a solution. We'd like to we're going to try filling out the MPI stuff first and then going back to I/O as for what to measure beyond CPU.
Alan Sill:
Once you get started, there are dozens of things to go back to the BMC. There are 187 different quantities on our current model of Dell server that we can monitor through Redfish at the time scale of Hertz. If we need to, you have to ask yourself, what's interesting about that? One of the most trivial ones that is very instructive is what you can get from the BMC is what I mentioned earlier, the memory power, as opposed to the CPU power. Now, if you look at the operating system, it'll tell you how much memory you've allocated, but that's not going to tell you anything about how that memory is being accessed. Memory power is directly correlated with what you're doing to access that memory, which I call memory churn. Sometimes you can get very interesting insights into the code's limitations without doing a single bit of code instrumentation just by looking at the memory power to CP power ratio. There are dozens of things. You can monitor the old saying. If you're not measuring it, you can't use it.
Zane Hamilton:
That's very true. Thank you, Alan. Chris, I think you had something you wanted to add.
Chris Stackpole:
I'm going to borrow Henry Neiman's phrase. I think there are a lot of users, and by users, I mean just kind of users of HPC, whether they're doing some administration on a smaller cluster, they've got shoved under their desk, or they're running a big data center. A lot of people are using HPC architecture to do a lot of work, and a lot of them only care about some of it. They want to know whether the system is on and if I can run my jobs. You're monitoring at that point is the very basics of is the system up? Is it responsive? Is it working? There are also a lot of other details that you start to care about as you grow in your resource usage and the number of nodes.
Chris Stackpole:
Especially if you're a limited shop, which a lot of the people I tend to work with, they're a couple of admins for a big cluster for the campus. Time is limited. The more you can monitor, and I went, go through and say, all right, I'm going to set up all the data sets first. I typically grow out, so I'd start pretty small with each new cluster and then add things that I know have bitten me. Alan mentioned power. We had a system where we didn't have the calculation for power, and at a certain load, it would trip, and we lose the nodes. Now, I care about my power level at a certain spot. Things like temperature are very important, especially when you're running many different types of jobs.
Chris Stackpole:
In a condensed space, some of these systems I've worked with are just shoved into a closet. That becomes a huge difference. You are about to monitor of temperature at that point. There are a lot of things that it's hard to say. Here are all the things you should be monitoring because it will vary a little bit more per group and what information you actually care about. Before I take the approach of, if it's bitten me, I want to monitor it and automate that monitoring. I use Zabbix. When you see this use case, fix it and only tell me about that if you can't fix it or your attempt to fix it fails. That's why I would want to be alerted to the critical things that involve me having to step in. Anything I can automate in the monitoring in the meantime is great to go back through my reports and say, Hey, this one node, it always seems to trip this one thing it fixes. Still, it's always tripping, and it's always this one node that's the value of the historical aspect of it.
Chris Stackpole:
It depends on how much you want to do and your monitoring tool because some monitoring tools. They aren't going to get you a lot. Others, you can get a whole lot. You can do a lot of stuff with Zabbix, but I'll be the first to admit it's complicated. There are so many tools that you can add; it takes a bit of investment. At this point, I can throw in all sorts of values and never really worry about the impact of performance. Generally, it's a cluster that's not being hit hard enough, or we've already segregated the monitoring traffic to its network. Then it's not that much of an impact. The value of having that data is more important to me.
Zane Hamilton:
Thanks, Chris. That's great. I see one question that Tron just popped up: Is anybody familiar with ELK? Has anybody used ELK? I've seen ELK in the enterprise quite a bit of people choosing that over things like Splunk for log aggregation.
Jonathon Anderson:
We were trying at CU to get the stack up and running with the staff that we had and the available documentation, mostly starting with the elastic search part of it.
Chris Stackpole:
Yeah, it's a powerful tool, especially if you're ingesting a lot of data. You're trying to look for patterns in it. I've known a lot of people who are ingesting logs from hundreds of systems, from different types of systems, whether the web servers or data servers or HPC or whatever. They're looking for certain patterns because it is very Splunk like in that regard. The work I've done with HC is more overkill than what I've needed before. If somebody's familiar with the tool and can do it, that's a different story. I'm going to jump in with this monitoring tool without doing a little bit more understanding of what ELK gives first.
Zane Hamilton:
I've seen that one be quite complicated once it's up and running, and you get it configured to manage and monitor the things you want. It's great, but it is a learning curve to get it up and running. I've seen people take up a year, and 18 months to get it set to where it's meaningful to them and get something back out of it. Great tool. We are up on time. We went over a little bit, and I really appreciate John, Alan, and Chris really appreciate you jumping on and helping us talk through this. I think we all got a lot of really good information. I know I have things to play with and read from Alan from your links. Thank you very much. Thanks for hopping on with short notice. Guys, we will see you next week, and we appreciate you joining us today.