Fuzzball HPC workflows now run natively on Microsoft Azure

Fuzzball HPC workflows now run natively on Microsoft Azure

Contributors

Chris Wolford, Director of Engineering

Fuzzball now deploys natively on Microsoft Azure, joining AWS, GCP, OCI, and on-prem bare metal as a first-class deployment target. A computational chemist can move a GROMACS simulation from AWS to Azure without changing anything in the workflow. This post shows what that looks like in practice, what Azure brings to high-performance computing (HPC) workloads, and what happens under the hood when a single fuzzball cluster azure deploy command stands up an entire production-grade cluster.

Write once, run on any cloud

Multi-cloud is a goal most engineering teams share, and one that's genuinely hard to execute. Teams maintain parallel toolchains: Terraform modules for AWS, separate deployment scripts for GCP, ARM templates on Azure, different monitoring stacks, and different Identity and Access Management (IAM) models. The "portable workflow" rarely makes it past the planning stage.

Fuzzball takes a different approach. The workflow definition (the file that describes your compute jobs, data movement, container images, and resource requirements) is provider-agnostic by design. Fuzzball's orchestration layer translates that abstract definition into concrete infrastructure on whatever cloud or bare-metal cluster sits underneath.

A computational chemist running GROMACS molecular dynamics simulations can target AWS today and Azure tomorrow. The container images, data ingress/egress paths, job sequencing, and resource requests stay identical across both environments.

The same architecture already supports on-prem deployments on clusters built with Warewulf, VMware, or manually, and cloud deployments on AWS, GCP, and OCI. Azure is the newest target in a provisioning model built to accommodate many.

Deploy a production AKS cluster in minutes

Deploying Fuzzball on Azure starts with fuzzball cluster azure deploy. What happens next is a two-phase provisioning process that stands up a complete, production-ready cluster without requiring the user to touch the Azure portal.

Phase 1: Bootstrap. The CLI submits an ARM template to Azure Resource Manager that creates the foundational resources: a resource group, Managed Identities for the Pulumi runner, an Azure Storage Account and Blob Container for state management, Azure Key Vault for secret encryption, and an Azure Functions handler that drives the rest of the deployment.

Phase 2: Application infrastructure. Once bootstrap completes, the Azure Functions handler runs a Pulumi program provisioning everything Fuzzball needs to operate: a private, regional Azure Kubernetes Service (AKS) cluster with Workload Identity enabled and the cluster autoscaler configured per node pool; Azure Database for PostgreSQL Flexible Server with high availability; Azure Files Premium NFS through a Private Endpoint for workflow I/O and container image caching; Azure DNS, an Azure CDN profile and endpoint, and Log Analytics for monitoring; networking with a Virtual Network (VNet), private subnets, and NAT egress; and the Fuzzball operator itself, deployed via Helm.

Both phases complete without manual intervention. Fuzzball uses the same approach on Azure as it does on AWS, GCP, and OCI, sharing core operator and observability components while using cloud-native resources appropriate to each platform.

If your team is already evaluating multi-cloud HPC infrastructure, request a demo to walk through a deployment in your own Azure subscription.

Production-grade security you configure once

For practitioners who want to understand what's actually running in their Azure subscription, here are the key architectural decisions.

Compute. The AKS cluster runs as a private, regional deployment with the AKS cluster autoscaler enabled per node pool and configurable minimum and maximum node counts. Standard workloads land on Standard_D and Standard_E series VMs; GPU-accelerated jobs use Standard_NC, Standard_ND, or Standard_NV series VMs with NVIDIA accelerators. Substrate VMs, the actual compute nodes where your containers execute, run a Rocky Linux 9 image and are provisioned dynamically based on workflow demand, with configurable disk sizes, memory minimums, and GPU image support.

Storage. Two tiers: Azure Blob Storage handles log storage; Azure Files Premium provides NFS volumes for workflow I/O data and container image caching, mounted into the cluster through a Private Endpoint with Managed Identity RBAC for access.

Database. Azure Database for PostgreSQL Flexible Server with zone-redundant high availability, private network access only, mandatory Transport Layer Security (TLS), automated backup retention, and point-in-time recovery.

Security. Managed Identities and Federated Identity Credentials map AKS workloads directly to Azure RBAC roles for the fuzzball-provision, fuzzball-secret, and billing service identities. No static keys to rotate or leak. Secrets and encryption keys live in Azure Key Vault. The AKS cluster uses private nodes with NAT egress for outbound access. Every layer follows least-privilege RBAC bindings.

Monitoring. A Log Analytics workspace collects cluster and application telemetry, configurable alert rules fire on the conditions you set, and notification channels deliver alerts by email or webhook.

Portability without the penalty

Practitioners should not need to think about any of that during daily work.

When a Fuzzball user writes a workflow, they define compute requirements in abstract terms: "I need 4 CPUs, 16 GB of memory, and one NVIDIA GPU." They specify container images, data sources, and job dependencies. They don't specify cloud regions, VM SKUs, or storage backends. Fuzzball's scheduler and provisioner translate those abstract requirements into concrete cloud resources at runtime.

The hard part of multi-cloud is the gap between "we support it" and "I moved my pipeline in five minutes." Fuzzball closes that gap. A genomics researcher who developed and validated a sequencing pipeline on an AWS-hosted Fuzzball cluster can run it on Azure without modifying the workflow file. Fuzzball is container-first and supports both Docker and Apptainer images, so the containers, data orchestration, and job sequencing all carry over.

For organizations running Fuzzball Federate (the layer that brokers workloads across multiple Orchestrate clusters), Azure support opens a new dimension. A federated deployment can now span an on-prem cluster, an AWS deployment, a GCP deployment, an OCI deployment, and an Azure deployment simultaneously. Federate's scheduling routes jobs to whichever environment offers the best combination of cost, performance, and data locality: a training job that needs H100s lands on Azure's ND-series VMs, and a data-sensitive simulation stays on-prem.

The workflow author hits run. Infrastructure policies handle routing.

One identity model across AWS, Azure, GCP, OCI, and bare metal

Maintaining consistent security across clouds is a real challenge. AWS IAM, Azure RBAC, Google Cloud IAM, and OCI Identity are different systems with different permission models, which often means organizations build separate access models and compliance workflows for each provider.

Fuzzball addresses this by implementing its own identity and access management layer on top of cloud-native primitives. Role-based access control governs who can submit workflows, access data, and manage clusters, regardless of which cloud hosts the cluster.

On Azure specifically, the deployment uses Managed Identities and Federated Identity Credentials to eliminate static credentials, Azure Key Vault for encryption key management and secret storage, and Azure RBAC for fine-grained authorization. These are Azure-native security services, wired into Fuzzball's unified security model automatically during deployment.

Deploy your first Fuzzball workflow on Azure

Fuzzball's Azure deployment is available now. If your team is evaluating cloud HPC infrastructure, or if you're already running Fuzzball on AWS, GCP, or OCI and want to extend to Microsoft Azure, talk to the CIQ team to see it in action, or request a demo to walk through a deployment in your own Azure subscription.

Ready to learn more about what CIQ can do for you?

Get in touch

Related posts

AI workflow orchestration: why separate platforms fail

AI workflow orchestration: why separate platforms fail

CIQ Fuzzball and Nvidia NIM for Voice-to-Text Processing

CIQ Fuzzball and Nvidia NIM for Voice-to-Text Processing

CIQ's Partnership with NVIDIA: Transforming Enterprise GPU Infrastructure

CIQ's Partnership with NVIDIA: Transforming Enterprise GPU Infrastructure

Fuzzball adds preview support for CoreWeave provisioning

Fuzzball adds preview support for CoreWeave provisioning

Built for scale. Chosen by the world’s best.

2.75M+

Rocky Linux instances

Being used world wide

90%

Of fortune 100 companies

Use CIQ supported technologies

250k

Avg. monthly downloads

Rocky Linux