Fuzzball Federate: Unify Complex HPC and AI/ML Jobs Across Cloud and On-Prem Resources
Last summer, we introduced CIQ Fuzzball, a modern computing platform for PIC (Performance-Intensive Computing), that simplifies the development and execution of complex HPC (High Performance Computing) and AI/ML (artificial intelligence/machine learning) workloads. Fuzzball provides scientists and researchers a powerful API (Application Programming Interface) to automate the provisioning and management of the necessary infrastructure to run their workloads and includes a graphical environment to not only interface with the system, but also develop these workflows.
Often, however, virtual resources and physical infrastructure will span availability through zones, regions, geographies and even clouds. While Kubernetes and traditional HPC clusters are effective for many workloads, these hybrid infrastructure requirements present massive challenges for both compute and data in these complex deployments.
At Supercomputing 2024, we will provide a first-look demo of Fuzzball Federate, where we’ll show off transparent unification of discrete Fuzzball clusters into a single, virtual resource. This enables workflows to land where it makes the most sense, including finding the right GPUs, balancing the cost of compute and data movement, and, lastly, management of the data.
Ultimately, Federate makes it possible for users to focus on what they do best, and not have to think about if they should run on a particular system or cloud. With Federate, you can now connect on-prem clusters with your AWS clusters into a hybrid computing environment. So, for more complex HPC and AI/ML workloads, you no longer need a PhD in infrastructure to define, deploy, and execute these important jobs.
Fuzzball: Substrate, Orchestrate, and Federate!
Historically, the infrastructure management layer of Fuzzball has two main components. Substrate delivers a custom container runtime and per-node resource manager, and Orchestrate manages and schedules complex, multi-step workloads as well as the data necessary to run these workflows.
These two components form the basis of an Orchestrate Cluster and have helped our customers deploy Fuzzball in the cloud and on-prem.
With this release, we introduce the third component, Federate, which unifies Orchestrate Clusters, providing seamless access of compute resources across on-prem clusters and cloud computing regions.
In a federated Fuzzball environment, users define and submit workflows with the same web UI and command-line interface as they would use in a single Orchestrate deployment. However, where workflows submitted directly to an Orchestrate Cluster will run only on the resources available to that cluster, workflows submitted to a Federate Cluster may run on any of the Orchestrate Clusters joined to the federation. These Orchestrate Clusters may be dynamically-provisioned cloud resources (e.g., running compute jobs on AWS EC2) or local “on-prem” compute clusters.
Federate evaluates the CPU, memory, accelerator, and storage requirements of the workflow against the resources available in each attached Orchestrate Cluster and dispatches the workflow to an appropriate cluster for execution. The Orchestrate Cluster then provisions the necessary resources (in cloud environments) and dispatches individual compute jobs via Substrate.
Standalone deployments of Orchestrate are still supported, of course; and these deployments can be joined with additional clusters in a federation at any time.
“Write Once, Run Anywhere” with Fuzzball Federate
Fuzzball Federate furthers our vision of a unified, comprehensive, and complete performance computing platform and it utilizes the same interfaces and workflow definition format as Orchestrate. With Federate, you can now define and deliver workloads on-prem and then scale them in the cloud without modification.
For example, workloads can be prototyped in the cloud before making a capital expense on local, production resources. Either way, you can know that your Fuzzball workflow will run repeatably, reliably, and performantly, no matter what resources you provide it.
How It Works: Example Use Case
Using Fuzzball Federate will be familiar to anyone who has used Fuzzball before. Simply log into a Fuzzball Federate context, rather than an Fuzzball Orchestrate context.
# fuzzball context login
Logging into current cluster context...
Using a secure browser, open the link to complete login:
https://auth.federate.fuzzball-demo.ciq.dev/auth/realms/[redacted]
Waiting for login completion... Account "User Account (admin@ciq.com)" in use
From here, you can view the clusters available.
# fuzzball cluster list
ID | NAME | KIND | STATUS
5e1918d4-d789-93df-ed97-a5328eab69e5 | fuzzball-aws-orchestrate-federate | CLUSTER_KIND_ORCHESTRATE | CLUSTER_STATUS_READY
893b0adf-0582-60ae-c1db-ba7240d7237e | fuzzball-on-prem-orchestrate-federate | CLUSTER_KIND_ORCHESTRATE | CLUSTER_STATUS_READY
c3dd66cf-d4b0-bbec-9f4e-b1b729ac20fd | fuzzball-aws-federate | CLUSTER_KIND_FEDERATE | CLUSTER_STATUS_READY
At that point, Fuzzball Federate can receive and dispatch any existing Fuzzball workflow.
# fuzzball workflow start -w hello-world.fz
Workflow "c54f7397-ffb6-465c-ab50-b3df6a34ff68" started.
Name: hello-world.fz
Email: admin@ciq.com
UserId: 02ec4bc8-7862-4df6-b125-90be7f57f064
Status: STAGE_STATUS_FINISHED
Cluster: fuzzball-aws-orchestrate-federate
Created: 2024-11-14 04:05:03AM
Started: 2024-11-14 04:05:04AM
Finished: 2024-11-14 04:05:56AM
Error:
Stages:
KIND | STATUS | NAME | STARTED | FINISHED
Workflow | Finished | c54f7397-ffb6-465c-ab50-b3df6a34ff68 | 2024-11-14 04:05:03AM | 2024-11-14 04:05:56AM
Image | Finished | docker://alpine:latest | 2024-11-14 04:05:08AM | 2024-11-14 04:05:33AM
Job | Finished | hello-world | 2024-11-14 04:05:53AM | 2024-11-14 04:05:55AM
# fuzzball workflow logs c54f7397-ffb6-465c-ab50-b3df6a34ff68 hello-world
Hello, world!
And, as described, workflows may be dispatched to any of the available clusters, depending on the resources that they require and the resources available.
# fuzzball workflow list -t 10
ID | NAME | EMAIL | STATUS | CLUSTER
4123f09e-d794-4d8b-ae66-475dab9b28b6 | hello-world.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
3d60fab6-5e21-44bc-9c66-2911e29877c0 | hello-world.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
92f27e95-edd0-405d-b386-8798714f727d | hello-world.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
71fdd2cf-e89f-4b93-94fe-a2c470464906 | hello-world.fz | admin@ciq.com | Finished | fuzzball-on-prem-orchestrate-federate
7938b0ba-1c9e-4af4-8d52-edabfe06e915 | hello-world.fz | admin@ciq.com | Finished | fuzzball-on-prem-orchestrate-federate
2a68d475-236d-4149-b221-c84a5b975c9c | hello-world.fz | admin@ciq.com | Finished | fuzzball-on-prem-orchestrate-federate
b8467dfe-e7da-4110-baa2-8084f8770bce | on-prem-volume.fz | admin@ciq.com | Finished | fuzzball-on-prem-orchestrate-federate
4be8a874-956b-4b02-b400-4b87dbeada5a | aws-volume.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
6d1fc5c3-8c69-40c8-9161-a1139375fca6 | aws-volume-sleep.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
c54f7397-ffb6-465c-ab50-b3df6a34ff68 | hello-world.fz | admin@ciq.com | Finished | fuzzball-aws-orchestrate-federate
Come See Fuzzball Federate in Action at SC24 or Schedule a Demo Now
We’re excited to share Fuzzball Federate next week at SC24 in Atlanta, and if you’re there, we’d love to talk to you about how it might fit your requirements. We’ll demo and have formal presentations on Federate and more in the booth. We’ll have dozens of our experts on site to answer your questions about Fuzzball and the rest of the CIQ Enterprise Linux and Performance Computing ecosystem! You can check out the full roster of booth presentations here.
So if you’re in Atlanta next week, please stop by our booth (#4131) for a chat, and if you can’t join us at SC24, please feel free to reach out to our team and we’ll schedule a call to walk you through CIQ Fuzzball Federate.