CIQ

Research Computing Roundtable: A Forum Discussion on User Management

February 16, 2023

Our Research Computing Roundtable will be discussing User Management in the HPC industry. Our panelists bring a wealth of knowledge and are happy to answer your questions during the live stream.

Webinar Synopsis:

Speakers:

  • Zane Hamilton, VP of Solutions Engineering, CIQ

  • Gary Jung, HPC General Manager, LBNL and UC Berkeley

  • Forrest Burt, HPC Systems Engineer, CIQ

  • Brian Phan, Solutions Architect, CIQ


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Full Webinar Transcript:

Zane Hamilton:

Hello and welcome to the CIQ HPC Roundtable again this week. Really appreciate you being here. I'm Zane Hamilton with CIQ. At CIQ, we're dedicated to driving the future of software infrastructure, utilizing cloud hyperscale and HPC technologies. We cater to a wide range of enterprise and research customers who trust us for unparalleled customer service and support for Rocky Linux, Warewulf, Apptainer. Our support is tailored to meet their unique needs and delivered in a collaborative spirit of open source. This week we're going to be talking about user management and HPC, and we have, I think we have three people to join us today. Perfect. We do. Welcome back, Gary. Good to see you. Forrest. Brian, welcome again. We'll go through a quick round of introductions, if you don't mind. As always, Gary, introduce yourself.

Gary Jung:

Sure. My name is Gary Jung. I manage the scientific computing group at Berkeley Lab, and I managed the HPC program down at UC Berkeley.

Zane Hamilton:

Perfect. Thank you, Forrest.

Forrest Burt:

Good afternoon. Number one. I'm Forrest Burt. I'm an HPC systems engineer here at CIQ. I work with a lot of our products, but I work a lot on our Fuzzball product and on the HPC operations side. So, happy to be on. Same. Thank you for having me.

Zane Hamilton:

Absolutely. Brian, welcome back.

Brian Phan:

Hey everyone. My name is Brian Phan. I'm a solutions architect here at CIQ. My background is in HPC administration and architecture with a little bit of experience in CAE and genetics.

Zane Hamilton:

That's great. Thank you guys. So this topic for me I've never had to administer users in an HPC environment, so it's a little bit different from what it is in the enterprise from what I've learned since I've been here. I'm going to be looking over at my notes, so make sure we cover these topics. I did drop in the outline guys, so if you want to follow along. First thing I want to dive into is just a real high overview of what user management is in an HPC environment. Because, I think from an enterprise perspective and from my perspective of building servers, once I build it, I really don't give users access to it. That's really not the way that it works. I know it's very different in HPC. I'll start with you Gary, give me an overview of what does HPC user management really mean?

What Is HPC User Management [7:25]

Gary Jung:

We think of HPC user management as the next level interface from the HPC systems management. The HPC systems administrator will keep the infrastructure running and all the software stack up to date but, then they could get users onto the system. But beyond that, that's where the user services or we call them user services in our group that's where they take over and they help to get users situated onto the system accounts allocations any hurdles that they need to get their codes running. They are the frontline of getting users productive on the system.

Zane Hamilton:

Excellent. So, Forrest, I know that you've spent time in academia and managing users and systems and that nature. From your perspective, tell me a little bit about user management.

Forrest Burt:

User management is essentially keeping everyone using the cluster in line. It's an interesting environment, as you've noted Zane, to be doing user management in them. Whereas when we think of enterprise computing, we may think of a server sitting in a data center that's only ever going to be touched by CIS admins. The entire point of the HPC cluster is that users can get on it, explore it, do things on it, load modules on it, do their compute on it. Ultimately, in the end, you end up with a very unique multi-tenant environment where you have people at all skill levels technically from undergraduates all the way up to PhD students. People who do computing for a living all interacting and co-existing within the same environment and trying to interact with the same systems.

It becomes a very interesting environment. Because, like I said, it's very unique to have potentially hundreds of users on this system at once. All trying to submit jobs, trying to get interactive nodes, trying to use IDE's use all kinds of stuff like that. It ends up presenting some interesting challenges with how you can deploy and work with software and then the things that you have to watch out for. Overall, its users tend to be more happy than anything that you're there to help them and that you're there to accelerate their compute. So it's 99% of the time a pleasant experience.

Zane Hamilton:

Thank you Forrest. Brian, I know that you're going through and you're working on designing a system or have been over the last couple of months, so I know user management in that environment was a little different than probably what we all would've chosen, but tell me about the importance, especially in an instance like that. What is the importance of user management?

The Importance Of User Management [10:15]

Brian Phan:

The importance of user management to me means knowing who is accessing your system and what they're accessing. So, depending on what type of data you're working with ideally, you want to know the right people are accessing the right data and you don't have any security vulnerabilities that would give the wrong people access to that data.

Zane Hamilton:

That's great. Thank you, Brian. I'm going to go back around and change into tools for user management. Gary, I don't know how you guys are doing that today, but I hear that there is a standard that a lot of people use. I mean, a back directory, active directory open all that, but how do you guys approach that from a tools perspective for account management?

Tools Used For Account Management [11:03]

Gary Jung:

Actually, we've just moved into one of the things that we're just putting into production is a HPC web portal for managing user accounts and allocations. It's based off of software called ColdFront. We got into this early and it has been a couple years in the making. It's been a huge help to have a web portal for people who request accounts, approve accounts, and manage their allocations. And so that's really new for us and we use it both at Berkeley Lab and UC Berkeley.

Zane Hamilton:

And that's just the account management piece, right? Or is that actually the authorization and authentication piece as well?

Gary Jung:

Yeah, well, that's the account management piece. I guess your question was about whether we use what we use for authorization.

Zane Hamilton:

No, several questions.

Gary Jung:

We were using flat files and we're moving towards using LDApp and so we're just got this thing figured out where we've got the attributes into LDApp so that this way we can use it for authorization also.

Zane Hamilton:

Very good. Forrest, what have you seen done and what have you worked with?

Forrest Burt:

One of the past installs that I was at was, if I recall correctly, LDApp based, it was one of those authentication platforms, but it's been a little bit now and I believe it was LDApp. I think there was some type of active directory system we had for another part of the site we were at, but it wasn't so much interfacing with what we were doing as LDApp. For example, what I've seen about that authentication piece. I like the idea of having these graphical interfaces to be able to manage the allocations and to manage the users and stuff like that. I've had a little bit of that before with some of the cluster managers that I've used. In general most of what I've seen in this has been wire ups around LDApp, for example an institution I was at basically had a campus VPN that got you access to the campus networks.

And so the clusters and things were behind that network. So even first off, you had to be able to have university access. and then from there when we would set up accounts, it links back into that existing structure so that we could keep track of what actual legitimate users at our site were using the cluster. We didn't have mismatches between the existing LDApp that's for everything else and what we were doing. Eventually, we can automate that to the point where we could basically just give it a name and an email and it would into the script automatically set up all the LDApp stuff, the account, the partitions and send them an email with their credentials and stuff like that, which was very nice. When I first started, we were doing a lot of that stuff manually.

So some of the things that I worked on was making that process a little bit more automatic. Yeah, I've seen that that was a very nice system to work with, being able to have all that integrated. A lot of that LDApp stuff we built up over time while I was there. That was really, really nice to see an action that was unique to the site. Or I shouldn't say it is unique to the site. Not all sites are at that same level of sophistication. I've also seen places where it's like anarchy cluster. Everyone has an account, everyone is getting themselves access, getting whoever access there's no integration with the rest of the authentication structure. So it varies widely, but I've been very pleased to have had the chance to work with some pretty well architected systems for what we were doing in that regard.

Zane Hamilton:

Thank you, Forrest. Brian, I know you've seen a lot of different environments outside of just academia, but how have you seen some of the resource management done? What tools?

Brian Phan:

My experience with on-prem systems is very similar to Forrest's experience. We mainly ran LDApp and I guess my first HPC job was one of the projects I worked on was just writing scripts to generate lDIF files, just so we can automate the creation of users on the system. I have worked with an HPC SCSS product before as well. And with that you go on the website, you create your account, your account information is usually just stored in the applications database. We also had features where you could do something like SSO and just log in with your Google, your Gmail account.

Zane Hamilton:

Very nice. All right, Gary, it's getting to the point now where I know there are challenges. I know there are successes, but from a challenge perspective, when you start managing a lot of users and not necessarily all from within your organization, how challenging can that be and how can we make it less challenging? So I know you've implemented the portal that has to help, but is that available to everybody outside as well?

Gary Jung:

Yeah, so yes, it is. Yeah. Then maybe, I'll give a little context. At Berkeley Lab we have about 1400 users on our infrastructure in about 500 projects. And then down at UC Berkeley, we have about 4,000 users in another group consisting of 800 product projects. We deal with a lot of users and just getting organized and having enough staff. That's probably a big challenge and making sure that you have the right policies and procedures in place so you can manage that. I think the challenges in trying to get all these people into the same environment and it gets really challenging moving from a national laboratory into academia because then the population is much more diverse. The readiness of people to move on to the system is much more diverse.

And so some people who've used computing for a long time they go on a system easily but then people who are newer or in fields that are just starting to make use of a lot of computation, those people are getting their environments and moving from something they've done for years into an HPC environment that becomes really challenging. We're talking about user services and now I've almost broken this into like three areas. And so the group is divided into three areas where we have our HPC systems administrator that I described earlier, and then we have the user services, which happens to deal with people who know how to use a system and having problems with the scheduler or running their jobs.

But then we have a set of consultants now who have to meet with the users and figure out where their needs are and help them move from their existing environment into our HPC cluster. And so then that's when you, it's almost like expanding concentric circles where now we're reaching out to other users that haven't used computing before. There's a real challenge in that because it's a lot more work for these users. It's almost as though, depending on the audience you may be able to get away with just having HPC systems administrators at your site if everybody knows how to use clusters, but then in a broader sense than you're going to have to put more into user services. There's a lot of challenges in that, which we could get into. But, just setting up that structure so that this way you can onboard people and they can be successful is a challenge.

Zane Hamilton:

Gary, from the user manage management perspective, you're talking about and having consultants there trying to figure out what everybody needs, it sounds like that could be challenging from a, not being able to have just a generic policy for users, but they're going to be a bunch of different policies based on different types of users, and that seems to be something that could just be ever changing and growing.

Different Users Different Policies and Procedures [20:05]

Gary Jung:

You know, we try to not have different policies. The challenges then are getting them to accept our policy or adapt to that, figure out workarounds for them so that they could use it. For example, we were touching, talking about authentication. We've always used one time tokens or passwords or whatever you want to call it, multi-factor authentication. We've always used that and since we started back in 2003, and we've always had pushback on that, but now that you can get it with your Google or your AT&T login is more common, but that people just want to do what's easy and getting people to adapt to our policies is always a process. It's a matter of meeting with them and talking with them and then making sure they're fully aware of what they're going to be moving into. They move into your environment.

Zane Hamilton:

So continual education.

Gary Jung:

Yeah. Yeah.

Zane Hamilton:

Very good. Forrest, I feel like you've had to deal with some of this as well, Brian, and they're going through some of this right now, but dealing with different policies and bringing people in that may be used to doing things one way and now we're changing them or moving to something else. How is that working and is it as painful as it sounds?

Forrest Burt:

I definitely agree with what Gary says there about it being a concentric circles type model of the different HPC needs and how you're serving them out to the users on a given cluster. When I was at my prior institution, we had I think five to 600 users on our cluster. I can't tell you how many active research projects, but about five to 600 users or so. So not a massive cluster, but still a pretty decent number of people on there. We were always very small staffed there. So while you know, there definitely at some sites becomes that delineation between CIS admins, user support people, that type of thing, we found that we all blended together. We had the people that we had and it was a pretty small team in general.

We all ended up working on all of these different things together. It's interesting what you end up having to do in that to end up meeting the needs of like Gary mentions these different users that are coming from all these different places, all these different skill levels. We would end up doing things we would interface with our users simply through email and that type of stuff. But we'd also end up doing office hours where we would invite people, especially very new people to come in and actually sit down with us and work face-to-face on that type of thing. Ultimately, in the end there are fairly robust ways to do the technical administration of a cluster on this.

There's a lot of automation, but starting to do that user outreach to not only the traditionally computational fields, but also into fields where people would have need for HPC but not maybe not realizing it. To be able to reach both of those users or both of those groups of users and then have meaningful ways for both of them to be onboarded is absolutely critical to growing and maintaining a practice of HPC at a place. Something that I spent a decent amount of time on was trying to meet researchers where they were at in education departments, biology departments education is a better example of one that was a little bit more far out from computing. But my point being in the end you're always going to end up with these environments that have, sorry, I just got a message. you're always going to, with these environments, especially in academia where you're going to have all different skill levels of users interacting with each other and being able to effectively manage the skills of a team to not only keep the technical stuff of a cluster running, but to be able to respond to all skill levels of user queries and do things like office hours and stuff like that is essential.

Zane Hamilton:

Brian go ahead.

Brian Phan:

Yeah, similar to what Gary and Forrest said, I think the biggest challenge is getting the user base to just some baseline level of experience in using the HPC system. From my experience, it's been a lot of one-on-one sessions with some grad students and just teaching them how to use a terminal and just stuff like that. And just going through the basics of how to submit a job and what you should and shouldn't do. You should be requesting resources, yeah, stuff like that.

Zane Hamilton:

Thank you Brian. And we do have a question from Dave. Do you want to throw that up? He makes a nice movie reference, but then the serious question though is does Apptainer make the management of elevated access easier? I think I know the answer, but I'm going to start off and go Gary again.

Does Apptainer Make Management Of Access Easier? [25:36]

Gary Jung:

Yeah, it does especially as we're talking about moving people, that is one of the tools that we can use to onboard people onto the cluster. If they have a very different environment, a specialized custom environment, that is a way to package it up and put it onto the cluster. There's a lot of other great reasons for using Apptainer but for onboarding people we also use it as a tool for doing that too. Then we're thinking that actually for people who want, we just had this problem come up this morning where somebody wants full control over their directory. They're the person who use to manage it. So here's an example where somebody used to manage their own system and now their department is moving to us for the computing, but this person usually takes care of the files and directories and ownerships and for all the people.

So he wants to be able to essentially have full control or he's asking for root access so you can manage all this. And we're thinking, well, what's a way around this? One way we're thinking might work is that we could maybe on one of our other systems, we could set up an Apptainer instance and have it mount only the file systems that he is responsible for and then have them manage it that way so that this way we don't have a way of segregating the control on a file system. So that's an example of something we're kicking around right now. Oh, here's another way we could do something. But, yeah, so yeah, so the answer to the question though is, yeah, Apptainer has a lot of uses.

Zane Hamilton:

Go ahead, Forrest. I know you have something you want to add and if you didn't, I was going to ask you to.

Forrest Burt:

I was just going to say one of the classic reasons that makes Apptainer useful in these environments with regards to elevated access the standard. And while rootless has become the thing that everyone is striving towards, this is implemented more or less in different containerization technologies these days. To what extent it helps with this problem, I'm not quite sure. But anyway for a long time Docker which was the main containerization platform required you to be able to have elevated privileges to interact with the Docker daemon that is actually like managing container execution and that type of thing. And just as Gary notes in a multi-tenant HPC environment, it doesn't quite work to be giving anybody who needs management of their own environment root access.

At that point, you obviously not only have root access to your own stuff, but everyone else's stuff on the cluster. It's a security nightmare. You can't really implement it effectively. Apptainer has always provided that way for people to build containers and these execution environments that are fundamentally untrusted relative to the cluster, but then to be able to take those and run them in a trusted way that's the biggest thing Apptainer does. Especially these days, with the newer versions of Apptainer that are out, it now supports a more wide range of rootless operations than it did before through newer integrations around FakeRoot and some newer stuff that's been put into it. All in all in the end, it gives you that ability to manage your own separate environment that is entirely containerized. It doesn't require a module system, that type of thing. It allows you as a system administrator to give that capability to your users in a manner that's not going to over privilege them relative to the cluster.

Zane Hamilton:

Thank you Forrest. I know Gary, you touched on earlier resource management and there are tools out there Slurm, for example. As a user, if I come in and I want to submit something and I want to get resources allocated, I want to do some amount of work, and I know every use case is going to be different across the different fields of research, they're going into this, but at what point in time do I actually need to be able to get on the system? Why, if I'm submitting something too Slurm to go do it or any type of resource management after I submit my work, what do I need to get on assistance to actually do as a user that would require me to be able to log onto those compute nodes anyway?

Submission Of Work And Access To A System [30:31]

Gary Jung:

Yeah, I'm not, so you mean outside of just submitting the job?

Zane Hamilton:

Is that what, so it seems like we have, I mean, Jupyter Notebooks are really popular for people being able to get on and do something like actually interact with something that's going on, or I hear a lot about people having to SSH into a compute node for some reason outside of I have submit a job and I need to SSH into something. What are those use cases like? And I'll open it up to all of you guys, I think you, I hope you assume going with Forrest and Brian what does that look like? Why would they do that?

Gary Jung:

You know, I'm going to let somebody else answer first this time and then maybe I'll get a better answer question.

Zane Hamilton:

Sure. Sorry.

Zane Hamilton:

Go ahead Brian, Forrest, either one.

Forrest Burt:

Brian, do you have anything?

Brian Phan:

Yeah, so I guess when users run simulations, they're typically large MPI jobs that run across multiple nodes. Sometimes, you might run into issues with LDApp on one of the nodes and eventually your simulation basically stalls and you don't know why. So basically, you go through, start logging into these nodes and investigate. Is there something actually wrong with LDApp if that's causing my simulation to stall? That's typically what I've done. Forrest, what have you experienced?

Forrest Burt:

For the most part, I've seen users needing to have actual SSH access to compute nodes in the context similarly of being able to do interactive jobs while not specifically in the simulation space. My prior institution for example, we had a command that users could run, we actually had two. One was basically a development session that would drop you for 12 hours onto a compute node and give you essentially a shell onto that and you could then arbitrarily go and use that compute node to test your code, do that type of thing. We also had a debug session, which was just a half hour long session on a compute node that people could use. It was meant for just quick debugging, quick job testing, that type of thing. In general, it's sometimes not a good thing to have users running around your compute nodes because if they can run around your compute nodes, especially while they don't have jobs going on it, you probably have some type of security problem and you're going to find that users are going to start doing things like executing work outside of your job scheduler on your nodes.

In general, again, you might find suddenly a user is coming to you, well this GPU job has been sitting here for a day. It's not doing anything. You get there and the 100% of the usage is actually on something. It's doing something that's basically not being tracked by Slurm. So you can run into those instances where users will end up running around the cluster. My point being in the end in general, it's not usually good to have just free and open access to the compute nodes. Users don't usually need access to that. In cases of being able to debug something that's going wrong, they'll need to be on there sometimes with a job while it's active. In a lot of cases, the need for users to be on a node like that is just basic interactivity.

They want to be able to test their code, they want to be able to make sure that what they're putting into Slurm is actually going to execute at the small scale on the cluster before they run it at the large scale. And testing that on the cluster itself is more effective than being on the laptop or something similar. So we also just a funny aside, you will also find that sometimes users, especially undergrads one time we were looking around trying to find zombie processes and stuff on our head node cleaning up, and we discovered that there were five or six VS code instances sitting there actively connected to the head node using it as a development and execution environment. So being able to provide capabilities like JupyterHub and VS code dev environments and stuff like that to your users via some type of structured way is important. Because enterprising researchers and stuff will sometimes work out their own little backdoor through your cluster. So it's good to have those abilities for people in a managed way that your job scheduler is aware of. So you can keep that control and keep users from using things they shouldn't be.

Zane Hamilton:

Thank you, Forrest. Does that help clarify what I was trying to ask Gary?

Gary Jung:

You know, I think Forrest gave a great answer. Probably the only thing I'll add in there is you started to mention JupyterHub and other things for people doing interactive work we've been using open on demand and that's worked out really well for users. You can spawn JupiterHub instances there. You can do a lot of your interactive work there. Then it has a desktop, so then we can put applications in the desktop. The nice thing about this is that people used to want to log in and maybe try to do some type of visualization or they want to copy files back to their local system to do visualization in something like open on demand allows you to do the visualization on the compute node this way. That's just another tool and a way of doing interactive work now.

Zane Hamilton:

That's great, thank you. Gary. I know we popped up a comment that Fernanda made talking about biology communities particularly dependent on having full control of machines. So it's good to see you, Fernanda. Thank you for joining. Then we do have one question from Opportunity Knocks. Systems that aren't connected via a shared public file system are numeric, UIDs still important to stay in sync between systems.

Shared File Systems [36:31]

Forrest Burt:

I think in the case of an HPC cluster, it might be the case that they are connected via a shared file system because you would expect to have, for example, your users would have a small slash home space that they might have a hundred gigabyte limit in that they can store actual stuff that data or code development work, they want to persist on the cluster of that type of thing. You would also typically find that there's a scratch file system that's meant for intensive complete Job IO as the target for that. I'm not entirely sure the exact thrust of numeric UIDs here. There might be something that I'm missing, but in general, you would expect that your nodes and machines and a compute cluster usually connected via some type of shared file system. You're obviously keeping track of your systems, numbering them. You've obviously not basically treating your compute as an amorphous blob of a bunch of servers that aren't really numbered. You do not have just a bunch of cores basically. you have still separate delineated out machines that the job scheduler is going to be aware of granularly. That usually in an HPC environment has one or more shared file systems attached to them, if that answers the question.

Zane Hamilton:

Thank you, Forrest.

Gary Jung:

Maybe, I can add in something. The systems aren't connected by a shared POSIXS file system, then yeah, you don't need to have the UIDS sync, but you don't know if that may change at some point. And either through NFS mount or moving users between systems or retrieving data from somewhere else and has the UID stamped on all the files. At our site we do register our UIDs so that this way everybody has a unique UID in anticipation of people working together in unexpected ways.

Zane Hamilton:

Okay. Thank you.

Forrest Burt:

Just to riff a little bit more off the SSH thing I wanted to say one of the headaches that's faced HPC for a little while is just the SSH based login type or plane that's been required for most people to get onto a cluster for some time. These are Linux clusters. A lot of the time when users are interacting with them they're literally just SSHing onto some type of node that gets them on with a cluster. Some sophisticated environments, you'll find that there's a login node and a head node. The login node is exclusively just for users to sit on, log into and run jobs and stuff like that. From the head node is the actual controller for the rest of the cluster where all of the system services, databases, schedulers monitoring, all that type of all that type of stuff is being run out of.

There are interesting efforts at the moment to remove the SSH login via some of the new HVC 2.0 solutions out there. There are a lot of want to eliminate the SSH based login flow because that once you can eliminate that, if you can find a different paradigm to keep users away from having to interact with the cluster of they'll mix command line at all that makes it far easier to onboard users because at that point you're not having to deal so much with teaching them computing as you are just giving them the plane to be able to do their science. So in general, the SSH based login is what you usually see that, like I said originally, users will use that to get onto a cluster. Usually, it's best in that type of case to have a login node so users can hammer the head node. In general, there's a lot of want to move away from that main SSH based flow at the moment and toward something that's a little bit more user friendly for people that aren't familiar with computing really from the start.

Zane Hamilton:

Thank you Forrest. I think when we look at user management, I know just like everything else there are the Gary's people doing actual management of those users. What types of tasks do those guys have to do? I mean, there are obviously tasks they probably should be doing daily that may or may not be getting done, but I mean, there's a daily, a weekly and monthly, there's has to be a process, right? You're looking at logs, you're verifying, you're doing audits, what would that look like? And then what stuff typically gets ignored first? I mean, everybody gets busy, everybody's resource constrained, so what does that look like and what should we be doing better? Let's start with Gary.

User Managers Tasks And How To Improve [41:37]

Gary Jung:

Wow. You know, actually that could be a pretty big list of things we could be doing better. We focus a lot on onboarding people, but a lot of times things change all the time. Research groups actually have a fairly high amount of turnover, especially in academia. This means that you may have met with the users and got them all settled on with the system, but a year from now half the group may have turned over. So there's always things I want to do to keep in touch more with users. Your original question was what things would we want to do on a regular basis? We do want to check in with our user population on a regular basis just to not only see how they're doing, but it also helps us do planning, future planning, strategic planning, and hear about projects. So many things require long lead times, like data center space or something like that. So if somebody's getting a couple million dollar grant in a whole bunch of users or they're moving a large user facility to your institution and just keeping in touch on a regular basis is probably one of our most important things to do.

Zane Hamilton:

Thank you, Gary. Forrest.

Forrest Burt:

There's any myriad number of small technical things that you can need to do on a daily, weekly, monthly basis. Something that immediately comes to mind is, as I alluded to, checking the quotas on those file systems for users continuously and warning them if they're starting to get towards their quota. Taking care of data cleanup and the scratch space usually does not have a very long lifetime for data. So ensuring that data isn't staying in there, filling that up for longer than it needs to. There's like I said, keeping in track of, as Gary mentioned, who's actually on the cluster, who's actively using it, that type of thing. There's any myriad of technical things that you could do for it. Ultimately, as Gary alluded to, the bigger thing is keeping in contact with the users, figuring out what they need from it.

Like I said, if grant money comes in for a new cluster, figuring out what use cases they're actually working most with, which cluster should be used for. One thing that comes to my mind that is going to be done on a yearly basis is an in-house HPC conference. Every year we would put on just like a day or two of talks from people around campus, people from around the state computing community, that type of thing. That just generates interest and awareness in what we did and basically exclusively, hey, there's an HPC department. Here's what people are using it for. Here's what you can use it for. Here's all the people. So that type of thing is really, really useful to keep track of your users because that is a great way to all at once bring the people who are really passionate and interested and that want to advance and build a computing community on the campus or at the site altogether.

So that type of thing. Making sure that there's a really good cadence between the researchers and the research computing team. Especially, like I said, if you can get everyone together and do things like that. It really goes a long way towards finding out even the technical side of what your users need. This may find that the issue is how the cluster is in operating perfectly, where inefficiencies are, that type of thing. So keeping in contact with the users is ultimately very critical to even finding out on a basic level what things need to be done regularly on the cluster.

Gary Jung:

I have to chime in here because that thing about the conference, we do that and it's really successful and so our IT division does it and, but for research computing we actually take over a track. So we run a whole track for the day and it has long and short sessions and covers everything from how-tos to what's new and people really like that.

Forrest Burt:

We would do a couple of software carpentry type things. We would have someone do a Julia one, someone do an R one, someone do a Python one. So there were not only talks but workshops and stuff taught by people on campus. It's pretty cool to see it all come together.

Zane Hamilton:

That's very cool. Brian, I know that we were talking to a group and they were having some direct access challenges and I don't know that they had checked their unauthorized access attempts logs. I know that's something that people should probably be paying attention to, but I don't know if you have anything you want to add from a what should we be doing perspective, Things like that.

Brian Phan:

For me, from my experience at a genomic sequencing company, a lot of it, instead of keeping in touch with the users, I would try to interface with the product teams more in that sense. Understanding what type of sequencing products that they want to come out with. Understanding how big the input data is. How much compute resources would it require to actually run this pipeline. And if we have enough capacity in our storage system to support this product at scale basically. And if we don't, then basically we would have to start, like Gary mentioned, the lead time to ordering more storage and actually implementing it so that we can get this product actually launched.

Zane Hamilton:

Thank you Brian. Alright, Gary, I have a question for you specifically, since you have a portal to do onboarding that makes the interface easier and we talk about having to go back and stay close to users. You bring a bunch of people on, there's a lot of turnover. How do you handle off-boarding or is that something that's done on a regular basis or is it once a year you go back and try to clean it up? How does that work? Especially with a new tool like the portal?

How To Handle Off-Boarding [48:10]

Gary Jung:

Yeah, you know that would fall under the category of things that's a little bit more aspirational and that is on the list lower on a priority, so we don't get to that as much. I mean right now our current policy is that we, because people will after they leave the institution in research doesn't mean that they've stopped doing their research. It just means they could be doing it from another place. We tend to take more of a passive role towards offboarding people and asking them to let us know when they don't need their account. But we will check on that periodically. And then one of the things that we do is that there is a charge for user accounts, so there is a motivation for people to help us do that. So it's not like it's free and if they ignore it, it doesn't cost them anything. So it does cost them. So maybe that's the one thing that helps us with offboarding is the fact that we charge for accounts and it's a per charge account charge.

Zane Hamilton:

And does that go back to a research lead or a department lead, or how does that work? 

Gary Jung:

Yeah, a more centralized person. So this way if there are collaborators, we'll take people who are, because in the research that we do, there's tons of collaborators that are part of the project, maybe not part of the institution. So we can't do something as simple as say, well you're no longer in our HR database and so we'll take it out because that's just way too restrictive and so we have to be a little bit more open in order to facilitate the research.

Zane Hamilton:

Thank you Gary. So I think the web portal thing seems fairly new in HPC from being able to request user access. So that's very cool. But Forrest, Brian, when it comes to requesting access I spent a lot of time talking to other universities and it seems like everybody has a very different process. Usually you have to dig around on their HPC site to find how to get access and then you have to start sending emails or trying to find the right person to get access. How have you seen that process work at other places and where have you seen it? How has it been more successful than others?

Requesting User Access In HPC [50:33]

Forrest Burt:

We basically just had on our website for research computing. If you're looking for an account, email us at this email with name, the name of your PI, what project you're on. I believe what the purpose of the use of the HPC resources is. Nine times out of 10 there was nothing more necessary than that. Usually I would quickly ping a PI just to check from their side to make sure this was a valid user. Because, in some cases when you're adding on these users, they might be labs that have condo nodes or something like that, so you might be adding them to custom queues that give them access to certain resources or certain data, that type of thing.

It was always important to make sure that these are legitimate users going into this lab. But I should say nine times out of 10, but 99 times out of 100 it was a legitimate user. I mean undergrads paying us or whatever. So for ours, like I said, it was pretty simple. You just had to traverse the website to the cluster guide and there it had getting an account was basically the first little bullet point. It was pretty simple, just email us with the info. That also gave us the ability to open a dialogue, oh, what code are you working with? What resources are you planning to use? Basically, starting to find out how we can further assist this person. In general, the process at ours was just ping us manually and we will get back to you within a business day.

Zane Hamilton:

What type of crypto are you planning on mining?

Forrest Burt:

Yeah, exactly. Or you have to, I believe there was actually a little cluster guidelines document that went out with that. And one of the things on it was do not mind crypto on my cluster. We will become aware of it very fast. Trust us. And you'll get nixed. So yeah, there was a specific note that said, do not mind cryptocurrency on my cluster.

Zane Hamilton:

And maybe this is something we can, or maybe you can't talk about, but I am always interested in those funny stories of we had users doing something so something dumb, something great. Like, do you guys have any fun stories about users on HPC?

Fun Stories About Users On HPC [53:02]

Forrest Burt:

The first one that comes to me is, like I said, it caused quite a quite a shock when we found half a thousand or more VS code instances running on our head node actively. We've had, like I said, users that figure out their way around the cluster and SSH their way around. I think the most egregious thing I can share is basically just people dumping incredible amounts of data onto the scratch file system actively. I mean, we're talking gigabytes and dozens of gigabytes and hundreds of gigabytes like scratch is starting to fill. This was like one of our regular users who we talked to quite a bit. There were always, there was sometimes a little bit of friction, but usually not like this. Scratch is filling and it's filling and it's starting to get to the point where it's going to start kicking off the deletion scripts that are going to begin cleaning it up and we just can't get ahold of him.

No one can get a hold of him. No one can reach him. No one can talk to him. And eventually in the end it came out that he was just being not very nice to the rest of the users on the cluster in somewhat of a deliberate manner. So that access was revoked in the end, but that's the most egregious thing I can think of. Dumping hundreds of gigabytes worth of data onto scratch and ignoring the research computing teams communications of what's up? The scratch is 50, 60, 75% full, and it's all coming from you. And yeah, no one could raise the guy, so that was pretty funny.

Zane Hamilton:

That's fantastic. Gary, anything you can share?

Gary Jung:

Well, I do have an interesting story and I think I might have mentioned it in this venue once before, but I, it's a good one. And our user services, not only do we onboard them onto the local HPC, but we'll help people get onto other resources. For example, our cloud computing, so we work with the cloud providers and we also onboard people onto the cloud and help them. Not only just provision the account, but will help them over hurdles so that they can get productive on a cloud also. We did have somebody who was just trying something out and they fired off a job on a Friday afternoon and I think they misunderstood how the charging worked. And they had a loop that went through BigQuery for like 35,000 lines and they, we got this very alarmed call on Monday morning about a user who had already racked up $490,000 worth on computing over the weekend. And not only that, the reporting is lagging behind what's actually happening. So, even if we shut it off, there's still some things that have to be tallied up and so that was probably the most egregious use of resources I've seen recently.

Zane Hamilton:

Yeah, that's a big one. Brian, how about you?

Brian Phan:

I have experienced something similar to that where instances have been left on and haven't been turned off and yeah, ended up racking up quite the EWS bill. As for funny stories, I'm not sure if I have any funny stories. They're more just painful like trying to convince I guess I spent months trying to convince a team to put memory requests into their pipeline so that it would be more reliable. It took me months to convince them to do it and eventually they ended up doing it and lo and behold, hey, we don't have as many P1's going on. And yeah, and I'm glad they ended up doing it, but I wasn't too happy that it took super long.

Final Thoughts From The Panel [57:15]

Zane Hamilton:

Thank you, Brian. So we are getting close to time guys. I would like to go back around and if you have anything to add to user management, anything else you want to add, and then we will actually wrap up for the week. I really appreciate you coming, Gary, Brian, Forrest. I expect you to come, so thanks for being here. Just kidding. So I'll start with you, Gary.

Gary Jung:

I think sometimes it's an underappreciated job, the user services. Most HPC administrators would prefer to work on the system than work with the users. It gets pushed on to the user services people. And so if anybody is out there who does this for a living, kudos to you. You're doing a great service.

Zane Hamilton:

Thank you, Gary. Brian.

Brian Phan:

I guess to wrap up, I guess let's see. Enable your users. Keep your users in check and ensure that they're using the system correctly. Just make sure that your users are achieving their goals, whether it's submitting research papers, whether it's business related. If they're productive, you're winning. Yeah, that's how.

Zane Hamilton:

Excellent. Brian, thank you. Forrest, last words.

Forrest Burt:

HPC is a really powerful field that can help a lot of different people with a lot of different types of research. Ultimately, HPC is about the users, the people that are doing research there. I always just come back to the mantra, I found in HPC we don't generally tell people how to do their science, we just help them do it more efficiently. And I love that. So it's always great working with the users. It's always fascinating to see what people are working on and to hear the passion and stuff that they bring to just what they do for a living. It's always a pleasure getting to hear from researchers. Getting to see what they do. I love getting to see what people actually use HPC for. 

Zane Hamilton:

Absolutely. Thank you Forrest. Thank you guys for joining. Dave, Fernanda, Art, really appreciate you being there. Opportunity Knox, thank you for the questions. Looking forward to seeing you guys next week. Appreciate you joining. Thanks guys.