Utilizing the AWS EFA with Apptainer
Apptainer is the most popular container solution for HPC. It has great integration with common specialized HPC hardware/software like GPU devices and MPI software stacks, and the containerized model it brings adds security and reproducibility that makes it ideal for deployment in a multi-user HPC cluster environment. One major use case for Apptainer is in GPU-accelerated workloads, where it makes the task of packaging together and integration of GPU with the application at hand simple and effective in an HPC environment.
The Elastic Fabric Adapter (EFA) is an HPC interconnect offered by AWS on some of its compute instances. The EFA provides remote direct memory access (RDMA) in the same manner as InfiniBand, Omni-Path, or other HPC interconnect implementations, and using it to link up your compute instances on AWS can lead to significant speedups in your HPC workloads. The EFA is not difficult to work with, but does require some specialized installation procedures on both the host and in the container.
A basic requirement for this demo is a small, two node cluster on AWS that can be utilized for testing. The instances types used for these nodes must support the EFA, so it’s recommended that you use two c5n.9xlarges (which the author used) or something similar. The author also set the CPU options to only present one thread per core. Please follow the guide at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html to set up your two nodes, and ensure they can connect to each other with fi_pingpong before proceeding if you’d like to also follow along.
For example, on the head, or “leader”, node instance, run:
$ fi_pingpong -p efa
And then, on the compute node instance, run:
$ fi_pingpong -p efa <IP of head node instance>
The -p efa option in both cases indicates we’d like to communicate over the EFA. If the EFA is working correctly, you should get a result like:
$ fi_pingpong -p efa 172.31.31.216
bytes #sent #ack total time MB/sec usec/xfer
Mxfers/sec
64 10 =10 1.2k 0.00s 1.72 37.20 0.03
256 10 =10 5k 0.00s 12.46 20.55 0.05
1k 10 =10 20k 0.00s 48.42 21.15 0.05
4k 10 =10 80k 0.00s 168.56 24.30 0.04
[error] util/pingpong.c:1876: fi_close (-22) fid -1420723196
Which are the performance results of the fabric interface pingpong test being run over the EFA in this case. If you find that this setup doesn’t work after following the AWS guide, check your security group configuration and ensure you have the right outbound rule for “All Traffic” within the instances’ security group set up, and check that your passwordless SSH between the instances as mentioned also in their guide is working correctly. To ensure that the instances in the security group can communicate over the EFA while also being able to communicate with the internet at large to download components for the container build (if you’re doing your build on one of the cloud instances instead of uploading the container after it’s built elsewhere), you should add an outbound rule for “All TCP” to “Anywhere-IPv4” in addition to the “All Traffic” within the security group outbound rule the guide tells you to create. This will enable inter-instance communication, while also allowing communication with the open internet.
With our instances set up and EFA verified as working, we’ll use the container below as an example for this article. This definition installs the EFA, and then builds the GROMACS molecular dynamics application against the EFA so GROMACS can utilize the EFA for acceleration. Here’s our definition:
Bootstrap: docker
From: rockylinux:8
%environment
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
export PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
export PATH=/usr/local/gromacs/bin/:$PATH
%post
dnf update -y
dnf -y groupinstall "Development Tools"
dnf -y install tar curl wget cmake
cd /
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
cd aws-efa-installer/
./efa_installer.sh -y --skip-kmod
source ~/.bashrc
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
export PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
cd /
mkdir gromacs
cd gromacs
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz
tar xfz gromacs-2022.tar.gz
cd gromacs-2022
mkdir build
cd build
cmake -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DCMAKE_C_COMPILER=/opt/amazon/openmpi/bin/mpicc -DCMAKE_CXX_COMPILER=/opt/amazon/openmpi/bin/mpicxx -DGMX_MPI=on -DGMX_SIMD=AVX_512 -DGMX_OPENMP=ON ..
make -j8
make install
source /usr/local/gromacs/bin/GMXRC
rm -rf /gromacs/gromacs-2022.tar.gz
Let’s break this down line by line and see what we’re up to here.
Our first section, at the very top, defines what the base image for our container will be. In this case, we are using an image from the Docker Hub that represents a Rocky Linux 8 base:
Bootstrap: docker
From: rockylinux:8
The next section, the %environment section, is important, but we’ll skip over it for the immediate moment as %environment is added to the container at the end of the build process, and the variables that we’re setting here are more important to consider then after we have the context of the rest of the container. We move instead to the %post section, where we can execute commands in the base image to add more software/functionality to our container.
We first update the container, install the standard development tools, and a small handful of tools: we’ll need: tar, curl, wget, and cmake.
%post
dnf update -y
dnf -y groupinstall "Development Tools"
dnf -y install tar curl wget cmake
We then run the following long chain of commands. Collectively, these commands navigate to the container filesystem root, download the EFA installation package from AWS, and then install it via the script provided in the package. These commands download and unpack the EFA installation package.
cd /
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
cd aws-efa-installer/
./efa_installer.sh -y --skip-kmod
source ~/.bashrc
This command that actually runs the EFA installer is the ./efa_installer.sh -y --skip-kmod line. In this case, we are using the -y option, which will automatically answer “yes” or otherwise skip areas in the script where it expects user input, which we can’t provide directly in that way to the container build while it’s in action. We also use the --skip-kmod option, which skips the attempt to install the EFA kernel modules. This is because kernel modules don’t exist in the container, which is effectively a swappable userspace run on top of the host’s kernel, where the relevant kmods to execution will be loaded instead.
Once the installation is complete, we source the .bashrc, and move to the next part of the install.
We’ve now installed the EFA into the container, and from here, we can install additional applications that we want to accelerate with the EFA. While this is toggleable, our install here also included OpenMPI and libfabric that the EFA is installed against and that we can subsequently use to fulfill the MPI and libfabric requirements for downstream applications. So, for example, we can use this OpenMPI/libfabric/EFA stack to build an application like GROMACS with MPI support that will then also work with the EFA device on AWS.
We then add the paths to some of our MPI and EFA libraries/binaries so that we can use them during the container build for compiling GROMACS.
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
export PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
We’ll then set up the build/compile for GROMACS itself. If you have a different application you’d like to build with EFA support, feel free to use this as a general guideline for what it takes to build something with EFA support. In this block, we create a space to install GROMACS into in the root directory of the container filesystem, use wget to pull down the GROMACS code, unpack, set up the environment for cmake/make to use for building GROMACS, build, install, and clean up.
One important aspect of this are the cmake options we apply, as these are generally the same type of options you may encounter elsewhere when trying to build applications again the EFA. Notice that we use -DCMAKE_C_COMPILER and -DCMAKE_CXX_COMPILER; all the others shown are mostly options only applicable to GROMACS, but with the two compiler variables, we specify the path to the MPI-aware compiler provided by OpenMPI (which was, again, installed with the EFA) so that the application will be built with support for being run over MPI. In this case, the path to these compilers is set when we installed the EFA, so installing that into a different directory than the ones under /opt that we used for the EFA install would necessitate changing the path to the MPI compiler used in the cmake variables.
cd /
mkdir gromacs
cd gromacs
wget https://ftp.gromacs.org/gromacs/gromacs-2022.tar.gz
tar xfz gromacs-2022.tar.gz
cd gromacs-2022
mkdir build
cd build
cmake -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DCMAKE_C_COMPILER=/opt/amazon/openmpi/bin/mpicc -DCMAKE_CXX_COMPILER=/opt/amazon/openmpi/bin/mpicxx -DGMX_MPI=on -DGMX_SIMD=AVX_512 -DGMX_OPENMP=ON ..
make -j8
make install
source /usr/local/gromacs/bin/GMXRC
rm -rf /gromacs/gromacs-2022.tar.gz
Once this is built, we’ll have GROMACS installed into our container, and be able to run it like any other system command as we’d expect. Once this is done, the %post section of the container build is finished.
The last thing this container build will do is add the %enviroment section to the container, before doing the final mksquashfs command that builds the Apptainer proper. These commands are the same as the ones we ran earlier during the container build before GROMACS, but in this case, they simply add the AWS OpenMPI and EFA binaries/libraries to the container PATH and LD_LIBRARY_PATH environmental variables permanently so they don’t just persist for the lifetime of the container build. We also add the path to the location where GROMACS was installed, so that it can be run like any other system command.
%environment
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
export PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH
export PATH=/usr/local/gromacs/bin/:$PATH
With that container definition in gromacs.def, we can run our container build:
$ apptainer build ~/gromacs.sif ~/gromacs.def
INFO: Starting build...
Getting image source signatures
Copying blob d28605281af9 skipped: already exists
Copying config 2c0d3de157 done
Writing manifest to image destination
Storing signatures
2023/06/14 18:44:30 info unpack layer: sha256:d28605281af974c51e91c5adf06b4e61692f43483e6909dae2f1ad77b19cc489
INFO: Running post scriptlet
+ dnf update -y
…
During which, we should see the EFA install correctly…
== Testing EFA device ==
Starting server...
Starting client...
Error: fi_pingpong test timed out.
==============================================================================
An EFA device has been detected but a ping test has failed. Please consult the
EFA documentation to verify your configuration.
==============================================================================
===================================================
EFA installation complete.
- Please logout/login to complete the installation.
- Libfabric was installed in /opt/amazon/efa
- Open MPI was installed in /opt/amazon/openmpi
===================================================
…Before compiling GROMACS successfully and building everything into the final container SIF:
…
[ 96%] Built target libgromacs
[ 96%] Building CXX object api/gmxapi/CMakeFiles/gmxapi.dir/cpp/resourceassignment.cpp.o
[ 96%] Linking CXX executable ../../bin/gmx_mpi
[ 96%] Building CXX object api/gmxapi/CMakeFiles/gmxapi.dir/cpp/context.cpp.o
[ 96%] Building CXX object api/gmxapi/CMakeFiles/gmxapi.dir/cpp/exceptions.cpp.o
[ 96%] Building CXX object api/nblib/CMakeFiles/nblib.dir/box.cpp.o
[ 96%] Building CXX object api/nblib/CMakeFiles/nblib.dir/gmxcalculatorcpu.cpp.o
[ 96%] Building CXX object api/gmxapi/CMakeFiles/gmxapi.dir/cpp/gmxapi.cpp.o
[ 96%] Building CXX object api/gmxapi/CMakeFiles/gmxapi.dir/cpp/md.cpp.o
[ 96%] Built target gmx
…
INFO: Adding environment to container
INFO: Creating SIF file...
INFO: Build complete: /home/rocky/gromacs.sif
And we can then test GROMACS over the EFA to see how it works. From some averaged non-EFA tests, we know to expect about 4.15 ns/day without the EFA, so let’s see if having the EFA enabled improves this.
First, run the commands below to download the GROMACS benchmark data and navigate to the directory with the data we’ll be using. You may have to install wget on the instance to do so.
$ sudo dnf install -y wget
$ wget ftp.gromacs.org/pub/benchmarks/water_GMX50_bare.tar.gz
$ tar -xf water_GMX50_bare.tar.gz
We’ll then transfer the container and the data over to the other instance we set up for this demo. We could also use a shared storage drive between the two, but for this simple demo, we’ll make a copy.
$ scp ~/gromacs.sif <other instance IP>:~/
$ scp -r ./water-cut1.0_GMX50_bare <other instance IP>:~/
This will first copy the container to your home directory on the other instance, and then recursively copy the data files. With this done, let’s head into the directory with the data we’ll be using for this demo.
$ cd water-cut1.0_GMX50_bare/1536
Then we’ll use apptainer exec to run a command inside the container to preprocess the GROMACS data:
$ apptainer exec ~/gromacs.sif gmx_mpi grompp -f pme.mdp
To run our container over the EFA, we’ll then use the command below. Remember that we’re using c5n.9xlarge instances, without hyperthreading, so we expect 18 cores available for use per instance.
$ mpirun -np 36 --host <IP of instance 1>:18,<IP of instance 2>:18 apptainer exec ~/gromacs.sif gmx_mpi mdrun -pin auto -v -noconfout -nsteps 5000 -ntomp 1 -s topol.tpr -g ./mdlog.log
Breaking down this command, we first use mpirun -np 36 to tell MPI to run the container with 36 MPI processes. We then use --host <IP of instance 1>:18,<IP of instance 2>:18 to tell mpirun what the IP address of the compute hosts are, and that they each have 18 processor slots (or essentially, cores) available for use. apptainer exec ~/gromacs.sif then allows us to run a command inside the container at the path ~/gromacs.sif, and we run the GROMACS command gmx_mpi mdrun -pin auto -v -noconfout -nsteps 5000 -ntomp 1 -s topol.tpr -g ./mdlog.log.
Information about the computation should start to print:
$ mpirun -np 36 --host <IP of instance 1>:18,<IP of instance 2>:18 apptainer exec ~/gromacs.sif gmx_mpi mdrun -pin auto -v -noconfout -nsteps 5000 -ntomp 1 -s ./topol.tpr -g ./mdlog.log
…
5000 steps, 10.0 ps.
step 0
step 100, remaining wall clock time: 199 s
And will eventually finish, giving us a higher ns/day of simulation time that we’d achieved before without the EFA:
Core t (s) Wall t (s) (%)
Time: 7283.015 202.306 3600.0
(ns/day) (hour/ns)
Performance: 4.272 5.619
Running this a few times shows this is a notable increase and our average is generally about 4.27 ns/day, higher than our non-EFA average of 4.15 ns/day. With such a small cluster in this case, the difference is minute, but with a larger cluster the gains would be much more obviously significant.
Now you know how to use the AWS EFA with an Apptainer. Keep an eye out for further articles about modern use cases with Apptainer from CIQ, and if you have any questions, please reach out to us at info@ciq.com.