34 min read

Workaround for Communication Issue with MPI Apps & Apptainer without Setuid

September 7, 2023

Introduction Apptainer and unprivileged user namespace Test with various MPIs, communication frameworks, interconnects, and intra-node communication transports Test Case 1: OpenMPI, UCX, and Omni-Path Posix Transport Error Workaround for Posix Transport Error Verify Posix Transport Works with apptainer-suid UCX version difference CMA transport error The trade-off between intra-node communication transports Workaround A: manual fallback, excluding Posix and CMA for intra-node communication Workaround B: automatic fallback and user namespaces detection by UCX PR #9213 Test case 2: IntelMPI, libfabric, and Omni-Path PSM2 with libfabric 1.13.2 and PSM2 with libfabric 1.17.0 Workaround A: OPX OFI provider with libfabric 1.17.0 Test case 3: HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand KNEM transport Explicitly use posix and cma Explicitly use sysv and cma Workaround A: explicitly use SysV and KNEM transport Workaround B: exclude posix and cma and leave a room for automatic transport selection Performance impact Posix transport vs. SysV transport 2 processes per node, data size smaller than 8192 bytes 2 processes per node, data size larger than 8192 bytes 8 processes per node, data size smaller than 8192 bytes 8 processes per node, data size larger than 8192 bytes 64 processes per node, data size smaller than 8192 bytes 64 processes per node, data size larger than 8192 bytes Loss of zero-copy transfer 2 processes per node, data size smaller than 8192 bytes 2 processes per node, data size larger than 8192 bytes 64 processes per node, data size smaller than 8192 bytes 64 processes per node, data size larger than 8192 bytes The case where there is no impact on performance 128 processes (32 processes per node x 4 nodes), data size smaller than 8192 bytes 128 processes (32 processes per node x 4 nodes), data size larger than 8192 bytes Summary User namespace and mount namespace of processes spawned by MPI launcher with Apptainer The basic idea of a workaround The Apptainer instance workaround for intra-node communication issue with MPI applications and Apptainer without setuid Single node test Multi-node test with Slurm The future plan for Apptainer v1.3.0 Appendix: How we built a container for each environment Building an Intel MPI benchmark container for OpenMPI, UCX, and Omni-Path environment Building the Intel MPI benchmark container for IntelMPI, libfabric, and Omni-Path environment Building Intel MPI benchmark container for HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand environment

Contributors

Yoshiaki Senda

Subscribe to our newsletter

In this article, (1) we first introduce the intra-node communication issue with MPI applications and Apptainer without setuid (default version of Apptainer v1.1.x and v1.2.x) that uses unprivileged user namespace that happens with certain types of intra-node communication transport. (2) We show you test results of various MPIs, communication frameworks, interconnects, and intra-node communication transports. (3) We explain how unprivileged user namespaces with Apptainer work behind the scenes and show you the best known workaround in the mean time.

Introduction

The investigation started off by receiving an MPI runtime issue with Apptainer v1.1.x and v1.2.x (without setuid), reported by multiple users. They reported a "Permission denied" shared memory error due to the failure of opening a file descriptor under /proc/$PID/fd wIth UCX, a communication framework for high-bandwidth and low-latency networks. The typical error message is the following:

parser.c:2045 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16/ucx/lib/libucp.so.0)
ucp_worker.c:1855 UCX  INFO    0xa505d0 self cfg#0 tag(self/memory knem/memory)
ucp_worker.c:1855 UCX  INFO    0xa505d0 intra-node cfg#1 tag(sysv/memory knem/memory)
ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16/ucx/lib/libucp.so.0)
ucp_worker.c:1855 UCX  INFO    0xe31f00 self cfg#0 tag(self/memory knem/memory)  rma(self/memory)
ucp_worker.c:1855 UCX  INFO    0xe31f00 intra-node cfg#1 tag(sysv/memory knem/memory)  rma(rc_mlx5/mlx5_0:1 sysv/memory posix/memory)
mm_posix.c:233  UCX  ERROR   open(file_name=/proc/785917/fd/61 flags=0x0) failed: Permission denied
mm_ep.c:171  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc000000f400bfdfd: Shared memory error

We tested (1) OpenMPI, UCX with Omni-Path, (2) IntelMPI, libfabric with Omni-Path, and (3) HPC-X (OpenMPI, UCX) with InfiniBand against Apptainer v1.1.x and v1.2.x and observed the same error as users reported. With Apptainer v1.0.x and earlier, setuid-root mode is the default mode for those; we didn’t face this issue. We will show you test results in later sections.

Apptainer and unprivileged user namespace

Since Apptainer v1.1.0, the default mode of Apptainer is changed to “without setuid” mode that depends on the “unprivileged user namespaces” kernel feature and mount namespace. The other mode of Apptainer, such as Apptainer with setuid-root that was the default mode until v1.1.0 released, runs the container in a new mount namespace but does not run in a new user namespace. The above “Permission denied” error looks like that is derived from this change.

mpirun -np 2 apptainer run hpc.sif
|_ apptainer
|  |_ process 1 (new user namespace #1/new mount namespace #1)
|_ apptainer
   |_ process 2 (new user namespace #2/new mount namespace #2)

Apptainer v1.1.x and v1.2.x (default mode)

mpirun -np 2 apptainer run hpc.sif
|_ apptainer
|  |_ process 1 (default user namespace #0/new mount namespace #1)
|_ apptainer
   |_ process 2 (default user namespace #0/new mount namespace #2)

Apptainer v1.0.x and earlier (or apptainer-suid)

Apptainer releases the apptainer package and apptainer-suid package for EL8/EL9 (e.g., Rocky Linux 8 and Rocky Linux 9) via EPEL and for Ubuntu Focal/Jammy via a PPA.
apptainer is the default package that uses “unprivileged user namespaces.” apptaier-suid adds setuid-starter, an suid-root helper. With the apptainer-suid package, users can run containers in a new user namespace with the apptainer exec --userns command.

Test with various MPIs, communication frameworks, interconnects, and intra-node communication transports

In this section, we briefly explain how we replicate the issue, how we performed tests in various environments, our experiments, and our findings. We used the Intel MPI Benchmark as an example MPI application for these tests. Please see the appendix for the definition files of containers we used in this article.

Test Case 1: OpenMPI, UCX, and Omni-Path

The cluster we used here for the tests is a Warewulf v4 managed HPC cluster with a Rocky Linux 8 based image, AMD EPYC CPU, and Intel Omni-Path HFI. (Thank you, DELL!) The workload manager we used here is Slurm; OpenHPC repository was used for its installation. The Apptainer we used here is apptainer 1.2.2-1.el8 from EPEL and the host OS is Rocky Linux 8.8.

$ lspci | grep HFI
e1:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

The following is the job script for the first test. Thanks to Fully containerized model, we can use srun to launch MPI processes without using host side MPI.

#!/bin/bash
#SBATCH --partition=opa
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

OMPI_MCA_pml=ucx \
OMPI_MCA_btl='^vader,tcp,openib,uct' \
UCX_LOG_LEVEL=info \
UCX_TLS=all \
UCX_NET_DEVICES=hfi1_0:1 \
SLURM_MPI_TYPE=pmi2 \
srun apptainer run ompi-imb.sif alltoall

Posix Transport Error

This is a part of the output that illustrate the errors. This posix transport error is due to the failure of opening file descriptor under /proc/$PID/fd with "Permission denied." (The UCX library inside this container is v1.10.1.)

ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory cma/memory rc_verbs/hfi1_0:1); 
ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(posix/memory cma/memory rc_verbs/hfi1_0:1); 
mm_posix.c:195  UCX  ERROR open(file_name=/proc/158012/fd/17 flags=0x0) failed: Permission denied
mm_ep.c:155  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000044002693c: Shared memory error

The posix transport needs ptrace capability when it uses proc link (ref). The ptrace capability only works with the same or children user namespaces. As we saw in the previous section, Apptainer without setuid relies on unprivileged user namespaces, and MPI application processes spawned by MPI launcher via apptainer belongs to different user namespaces. Those processes launch a children process such as UCP Worker at MPI initialization phase. UCP Worker checks connectivity of intra-node communication transport between other UCP Workers. UCP Worker reports posix transport connectivity error at this phase, because UCP Workers are children of processes under a different user namespace.

mpirun -np 2 apptainer run hpc.sif
|_ apptainer
|  |_ IMB-MPI1 (new user namespace #1/new mount namespace #1)
|     |_UCX instance (UCP Worker) 1 (new user namespace #1/new mount namespace #1)
|_ apptainer
   |_ IMB-MPI1 (new user namespace #1/new mount namespace #1)
      |_UCX instance (UCP Worker) 2 (new user namespace #2/new mount namespace #2)

About UCP Worker, please see section 3.3 on UCX API Standard document http://openucx.github.io/ucx/api/latest/ucx.pdf

Workaround for Posix Transport Error

This error can be avoid by setting a UCX_POSIX_USE_PROC_LINK=n environment variable at runtime. This switches the mode of the posix transport to not use a proc link so that it does not require ptrace capability–hence, it should work in such a case.

Verify Posix Transport Works with `apptainer-suid`

With UCX v1.10.1 and apptainer-suid, we also don’t see any errors. This is the sample output with apptainer-suid.

ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(self/memory cma/memory rc_verbs/hfi1_0:1); 
ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(posix/memory cma/memory rc_verbs/hfi1_0:1); 
ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: tag(rc_verbs/hfi1_0:1);

UCX version difference

With the UCX master branch (before 5f00157 merged on Aug 25th) and apptainer, results are a bit different compared to UCX v1.10.0. The UCX master branch uses sysv transport for intra-node communication rather than posix transport when both transports are available. UCX v1.10.0 uses the posix transport first for intra-node communication in the same situation.

ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
parser.c:2044 UCX  INFO  UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info UCX_NET_DEVICES=hfi1_0:1
ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
ucp_worker.c:1877 UCX  INFO    0x12af4a0 self cfg#0 tag(self/memory cma/memory rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0x12af4a0 intra-node cfg#1 tag(sysv/memory cma/memory rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0x12af4a0 inter-node cfg#2 tag(rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0x12af4a0 inter-node cfg#3 tag(rc_verbs/hfi1_0:1)

CMA transport error

When we use the sysv transport instead of the posix transport, the Intel MPI Benchmark Alltoall only runs for below 8192 bytes data transfer, and we got a cma transport error above 8192 bytes data transfer.

UCX has the threshold for using the zero-copy transfer UCX_ZCOPY_THRESH and the default value is UCX_ZCOPY_THRESH=16384 (ref). 16384 bytes transfer comes after 8192 bytes on an Alltoall test, so this explains why it failed after 8192 bytes transfer. cma is one of zero-copy transfer mechanisms.

The cma transport also needs ptrace capability, hence this transport fails with apptainer in the same situation we explained in the above section.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 2 
# ( 14 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000         0.86         0.87         0.87
            2         1000         0.84         0.84         0.84
            4         1000         0.84         0.84         0.84
            8         1000         0.84         0.85         0.85
           16         1000         0.84         0.85         0.84
           32         1000         1.38         1.39         1.39
           64         1000         1.37         1.38         1.38
          128         1000         1.38         1.39         1.39
          256         1000         1.52         1.57         1.54
          512         1000         1.74         1.74         1.74
         1024         1000         1.96         1.96         1.96
         2048         1000         2.38         2.39         2.38
         4096         1000         3.19         3.21         3.20
         8192         1000         5.42         5.55         5.48
Assertion failure at /builddir/build/BUILD/opa-psm2-PSM2_11.2.230/ptl_am/ptl.c:153: nbytes == req->req_data.recv_msglen
ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
parser.c:2044 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
IMB-MPI1: Reading from remote process' memory failed. Disabling CMA support
IMB-MPI1: Reading from remote process' memory failed. Disabling CMA support

The trade-off between intra-node communication transports

UCX switches transports for intra-node communication depending on transferring data size because there is a tradeoff between single-copy transfer and zero-copy transfer. Zero-copy transfer requires a system call, and the system call introduces extra latency when transferring relatively small data. On the other hand, a single-copy transfer requires extra time for data copy. If that extra time exceeds the latency introduced by the system call, single-copy will be slower than zero-copy and vice versa.

Zero-copy actually means “single copy” and the name “zero” comes from “zero extra copies” and single-copy means “one extra copy” (ref).

There are other zero-copy type transports such as knem and xpmem. We will use knem with HPC-X later. xpmem is known for less system call usage compared to other zero-copy type transports, hence we expect good scalability compared to other zero-copy transport; but we didn’t test this transport in this article since we are focusing on the specific intra-node communication issue.

Image from: https://github.com/openucx/ucx/blob/master/docs/doxygen/Architecture.png
LICENSE https://github.com/openucx/ucx/blob/master/LICENSE

Workaround A: manual fallback, excluding Posix and CMA for intra-node communication

We confirmed posix (proc link mode) and cma transport for intra-node communication doesn’t work between processes under different user namespaces spawned by the MPI launcher via apptainer and confirmed sysv transport works on the same case.

Excluding posix and cma transports by using UCX_TLS=^'posix,cma' may work with apptainer, so let’s test this scenario.

#!/bin/bash
#SBATCH --partition=opa
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

OMPI_MCA_pml=ucx \
OMPI_MCA_btl='^vader,tcp,openib,uct' \
UCX_LOG_LEVEL=info \
UCX_TLS=^'posix,cma' \
UCX_NET_DEVICES=hfi1_0:1 \
SLURM_MPI_TYPE=pmi2 \
srun apptainer run ompi-imb.sif alltoall

It works and it uses sysv and ib for intra-node communication.

ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
parser.c:2044 UCX  INFO  UCX_* env variables: UCX_TLS=self,sysv,ib UCX_LOG_LEVEL=info UCX_NET_DEVICES=hfi1_0:1
ucp_context.c:2127 UCX  INFO  Version 1.16.0 (loaded from /opt/ucx/lib/libucp.so.0)
ucp_worker.c:1877 UCX  INFO    0xedf1d0 self cfg#0 tag(self/memory rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0xedf1d0 intra-node cfg#1 tag(sysv/memory rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0xedf1d0 inter-node cfg#2 tag(rc_verbs/hfi1_0:1)
ucp_worker.c:1877 UCX  INFO    0xedf1d0 inter-node cfg#3 tag(rc_verbs/hfi1_0:1)

This is an example output of an Alltoall result, using 16 processes (4 processes per node x 4 nodes).

The conditions and compute node specs are pretty different to other cases shown in this article. Time displayed here is not directly comparable to other cases.

#----------------------------------------------------------------
# Benchmarking Alltoall
# #processes = 16
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000        34.03        45.82        40.83
            2         1000        34.20        47.19        41.27
            4         1000        24.13        29.33        26.50
            8         1000        24.64        29.89        26.90
           16         1000        25.49        30.33        27.89
           32         1000        24.13        40.25        33.37
           64         1000        39.09        67.24        57.80
          128         1000        45.20        61.63        55.32
          256         1000        45.06        61.75        55.71
          512         1000        88.64       106.03        97.98
         1024         1000        50.36       103.29        65.57
         2048         1000        53.70       101.88        69.13
         4096         1000        64.13        79.01        70.60
         8192         1000        89.85       106.78        99.54
        16384         1000       153.38       174.32       164.66
        32768         1000       283.60       399.88       359.97
        65536          640       484.47       661.20       597.43
       131072          320       890.77      1249.68      1112.92
       262144          160      1818.17      2060.94      1981.74
       524288           80      4034.32      5203.53      4689.51
      1048576           40      8448.32     10160.09      9205.72
      2097152           20     19451.08     21342.62     20755.74
      4194304           10     37126.31     39543.59     38329.14

Workaround B: automatic fallback and user namespaces detection by UCX PR #9213

The pull request #9213 by Thomas Vegas on the UCX repository addresses this intra-node communication issue and proposed detection of user namespace difference between UCX instances and automatic fallback to sysv transport for both posix and cma.

I tested this fallback feature with UCX v1.10.1. I accessed this PR on 2023 July 17th and used the patch at that time.

On 2023 Aug 25th, https://github.com/openucx/ucx/pull/9213 merged to master branch.

This is the sample output and this indicates sysv and ib is selected for intra-node communication.

ucp_worker.c:1720 UCX  INFO  ep_cfg[0]: tag(posix/memory cma/memory rc_verbs/hfi1_0:1); 
ucp_worker.c:1720 UCX  INFO  ep_cfg[1]: tag(sysv/memory rc_verbs/hfi1_0:1); 
ucp_worker.c:1720 UCX  INFO  ep_cfg[2]: tag(rc_verbs/hfi1_0:1);

It successfully runs and this is the sample output of Alltoall for 16 processes (4 processes per node x 4 nodes).

The conditions and compute nodes specs are pretty different to other cases shown in this article. Time displayed here is not directly comparable to other cases.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 16 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000        34.72        47.37        41.16
            2         1000        35.29        46.93        41.33
            4         1000        24.99        28.98        26.37
            8         1000        25.00        29.48        26.55
           16         1000        25.22        29.96        27.06
           32         1000        28.70        37.31        32.11
           64         1000        44.68        66.88        57.40
          128         1000        31.30        38.62        35.43
          256         1000        33.18        40.86        36.88
          512         1000        53.59        69.53        63.76
         1024         1000        47.96       110.59        69.14
         2048         1000        49.43       110.67        69.90
         4096         1000        59.86        72.14        66.62
         8192         1000        66.97        87.16        77.50
        16384         1000       133.63       142.04       138.21
        32768         1000       213.63       254.62       232.93
        65536          640       423.80       489.25       441.71
       131072          320       788.05       918.56       858.12
       262144          160      1685.61      1876.19      1804.51
       524288           80      4046.91      4349.73      4180.03
      1048576           40      6056.49      7065.89      6744.36
      2097152           20     13290.67     14372.73     14010.01
      4194304           10     27122.13     28860.79     28179.71

Test case 2: IntelMPI, libfabric, and Omni-Path

The cluster we used here for the tests is a Warewulf v4 managed HPC cluster with a Rocky Linux 8 based image, AMD EPYC CPU, and Intel Omni-Path HFI. (Thank you, Dell!) The workload manager we used here is Slurm, and the OpenHPC repository was used for its installation. The Apptainer we used here is apptainer 1.2.2-1.el8 from EPEL and the host OS is Rocky Linux 8.8. IntelMPI is version 2021.6.0 from Intel oneAPI HPC Toolkit 2022.2.0.

Image from: https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-library-2019-over-libfabric.html
© Intel Corporation

The following is the job script for this environment. Thanks to the fully containerized model, we can use srun to launch MPI processes without using host side MPI. For the details of runtime switch we used here, please see Intel MPI Library Over Libfabric documentation and Libfabric Programmer's Manual.

#!/bin/bash
#SBATCH --partition=opa
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so.0
export I_MPI_PMI=PMI2
export I_MPI_DEBUG=4
export I_MPI_OFI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export PSM2_DEVICES="shm,self,hfi1_0:1"
srun --mpi=pmi2 apptainer run mpi-imb.sif alltoall

PSM2 with libfabric 1.13.2 and PSM2 with libfabric 1.17.0

This is a sample output when we use a PSM2 OFI provider with both libfabric 1.13.2 (IntelMPI bundled version) and libfabric 1.17.0 (the Rocky Linux 8 system default version).

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 2 
# ( 14 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
            1         1000         1.23         1.31         1.27
            2         1000         1.24         1.33         1.28
            4         1000         1.23         1.34         1.29
            8         1000         1.23         1.34         1.28
           16         1000         1.24         1.38         1.31
           32         1000         1.56         1.73         1.64
           64         1000         1.51         1.66         1.59
          128         1000         1.51         1.66         1.59
          256         1000         1.56         1.71         1.63
          512         1000         1.69         1.86         1.78
         1024         1000         1.93         2.02         1.97
         2048         1000         2.11         2.29         2.20
         4096         1000         2.73         2.90         2.82
         8192         1000         4.60         4.75         4.67
c5.150027IMB-MPI1: Reading from remote process' memory failed. Disabling CMA support
c5.150028IMB-MPI1: Reading from remote process' memory failed. Disabling CMA support
c5.150027Assertion failure at /builddir/build/BUILD/opa-psm2-PSM2_11.2.230/ptl_am/ptl.c:153: nbytes == req->req_data.recv_msglen
c5.150028Assertion failure at /builddir/build/BUILD/opa-psm2-PSM2_11.2.230/ptl_am/ptl.c:153: nbytes == req->req_data.recv_msglen
/.singularity.d/runscript: line 4: 150028 Aborted                 (core dumped) /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@"
/.singularity.d/runscript: line 4: 150027 Aborted                 (core dumped) /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@

Workaround A: OPX OFI provider with libfabric 1.17.0

The following is the job script for using an OPX OFI provider instead of a PSM2 OFI provider.

#!/bin/bash
#SBATCH --partition=opa
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so.0
export I_MPI_PMI=PMI2
export I_MPI_DEBUG=4
export I_MPI_OFI_PROVIDER=opx
export FI_LOG_LEVEL=Debug
srun --mpi=pmi2 apptainer run impi-slurm.sif alltoall

It runs successfully with an OPX OFI provider (ref1, ref2) with libfabric 1.17.0. (libfabric 1.13.2 doesn’t support the OPX OFI provider.)

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 16 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.05         0.04
            1         1000         7.56         9.02         8.47
            2         1000         7.42         8.88         8.33
            4         1000         8.91        10.52         9.64
            8         1000         8.89        10.32         9.55
           16         1000        17.41        20.10        18.79
           32         1000        17.62        20.63        19.16
           64         1000        19.29        22.00        20.67
          128         1000        22.14        25.63        24.01
          256         1000        27.05        32.93        30.07
          512         1000         9.91        12.91        11.27
         1024         1000        12.63        17.31        15.15
         2048         1000        19.19        25.31        21.93
         4096         1000        34.55        41.74        37.46
         8192         1000        74.98        98.33        87.40
        16384         1000       133.60       165.94       150.87
        32768         1000       181.36       243.23       214.27
        65536          640       318.24       467.85       413.83
       131072          320       624.44       918.17       797.29
       262144          160      1366.07      1849.15      1599.16
       524288           80      3954.24      4015.58      3988.23
      1048576           40      8239.24      8445.37      8354.97
      2097152           20     16091.88     16453.54     16266.20
      4194304           10     27105.76     27548.96     27392.86

Test case 3: HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand

The cluster we used here for the tests is a Warewulf v4 managed HPC cluster with a Rocky Linux 9 based image, Intel Xeon Processor E5 v4 CPU, NVIDIA ConnectX-6 Infiniband, MOFED (ref), and NVIDIA HPC-X Software Toolkit (ref).

The following is the job script for this environment. Thanks to the fully containerized model, we can use srun to launch MPI processes without using host side MPI. We rebuilt OpenMPI bundled with HPC-X against the Slurm PMI library in the previous section.

#!/bin/bash
#SBATCH --partition=hpcx
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

UCX_LOG_LEVEL=info \
srun --mpi=pmi2 apptainer run hpcx-imb.sif alltoall

Got a posix transport (proc link mode) error and an hcoll error.

ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/lib/libucp.so.0)
parser.c:2045 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
ucp_worker.c:1855 UCX  INFO    0x1c16bd0 self cfg#0 tag(self/memory knem/memory)
ucp_worker.c:1855 UCX  INFO    0x1c16bd0 intra-node cfg#1 tag(sysv/memory knem/memory)
ucp_worker.c:1855 UCX  INFO    0x1c16bd0 inter-node cfg#2 tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_2:1)
ucp_worker.c:1855 UCX  INFO    0x1c16bd0 inter-node cfg#3 tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_2:1)
ucp_worker.c:1855 UCX  INFO    0x254c710 self cfg#0 tag(self/memory knem/memory)  rma(self/memory)
ucp_worker.c:1855 UCX  INFO    0x254c710 intra-node cfg#1 tag(sysv/memory knem/memory)  rma(rc_mlx5/mlx5_0:1 sysv/memory posix/memory)
mm_posix.c:233  UCX  ERROR   open(file_name=/proc/71969/fd/99 flags=0x0) failed: Permission denied
mm_ep.c:171  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc0000018c0011921: Shared memory error
[LOG_CAT_P2P] UCX returned connect error: Shared memory error
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed
Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed

But it runs successfully and got a result. The notable difference here is knem was used instead of cma.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 16 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.04         0.03
            1         1000         4.89         5.74         5.26
            2         1000         4.89         5.76         5.23
            4         1000         9.60        11.77        11.01
            8         1000        10.36        12.46        11.67
           16         1000        10.68        12.75        11.95
           32         1000        11.21        13.31        12.44
           64         1000        11.97        13.89        13.09
          128         1000        12.34        14.55        13.72
          256         1000        13.42        16.15        15.06
          512         1000         7.65        10.39         9.25
         1024         1000         8.76        12.10        10.63
         2048         1000        12.14        15.89        14.42
         4096         1000        19.92        23.51        21.97
         8192         1000        38.49        45.05        41.75
        16384         1000        70.44        94.53        76.79
        32768         1000       132.40       179.49       148.36
        65536          640       156.87       161.01       158.83
       131072          320       286.91       292.65       289.76
       262144          160       729.10       803.72       772.81
       524288           80      1241.64      1296.77      1273.55
      1048576           40      2399.29      2529.89      2485.62
      2097152           20      4652.42      4853.07      4753.89
      4194304           10      9854.13     10021.56      9944.17

KNEM transport

The difference between test case 1 and the above case is OpenMPI bundled with HPC-X is built against knem, the other option for zero-copy transfer. Thanks to knem, HPC-X will not fail due to user namespace separation unlike cma.

Explicitly use `posix` and `cma`

If we intentionally use posix and cma, it failed to run as expected.

#!/bin/bash
#SBATCH --partition=thor
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export UCX_LOG_LEVEL=info
export UCX_TLS=self,posix,cma,ib
srun --mpi=pmi2 apptainer run hpcx-imb.sif alltoall

ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/lib/libucp.so.0)
ucp_worker.c:1855 UCX  INFO    0x24ef030 self cfg#0 tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
ucp_worker.c:1855 UCX  INFO    0x24ef030 intra-node cfg#1 tag(posix/memory cma/memory rc_mlx5/mlx5_0:1)
mm_posix.c:233  UCX  ERROR   open(file_name=/proc/69802/fd/27 flags=0x0) failed: Permission denied
mm_ep.c:171  UCX  ERROR   mm ep failed to connect to remote FIFO id 0xc0000006c000ff24: Shared memory error
pml_ucx.c:424  Error: ucp_ep_create(proc=13) failed: Shared memory error

Explicitly use `sysv` and `cma`

This time with sysv and cma.

#!/bin/bash
#SBATCH --partition=thor
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export UCX_LOG_LEVEL=info
export UCX_TLS=self,sysv,cma,ib
srun --mpi=pmi2 apptainer run hpcx-imb.sif alltoall

We got a cma error as expected.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 2 
# ( 14 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.03         0.03
            1         1000         0.59         0.59         0.59
            2         1000         0.58         0.59         0.58
            4         1000         0.58         0.59         0.58
            8         1000         0.57         0.59         0.58
           16         1000         0.58         0.59         0.58
           32         1000         0.63         0.66         0.64
           64         1000         0.63         0.65         0.64
          128         1000         0.78         0.80         0.79
          256         1000         0.83         0.87         0.85
          512         1000         0.74         0.85         0.80
         1024         1000         0.85         0.95         0.90
         2048         1000         1.11         1.23         1.17
         4096         1000         1.59         1.70         1.65
         8192         1000         2.55         2.64         2.60
ucp_ep.c:1504 UCX  DIAG  ep 0x1492f8017040: error 'Connection reset by remote peer' on cma/memory will not be handled since no error callback is installed
cma_ep.c:81   process_vm_readv(pid=601257 {0x2b5a0d0,16384}-->{0x3571e30,16384}) returned -1: Operation not permitted
ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/lib/libucp.so.0)
parser.c:2045 UCX  INFO  UCX_* env variables: UCX_TLS=self,sysv,cma,ib UCX_LOG_LEVEL=info
ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/lib/libucp.so.0)
ucp_worker.c:1855 UCX  INFO    0x2037070 self cfg#0 tag(self/memory cma/memory rc_mlx5/mlx5_0:1)
ucp_worker.c:1855 UCX  INFO    0x2037070 intra-node cfg#1 tag(sysv/memory cma/memory rc_mlx5/mlx5_0:1)
ucp_worker.c:1855 UCX  INFO    0x2037070 intra-node cfg#2 tag(sysv/memory cma/memory rc_mlx5/mlx5_0:1)
ucp_context.c:2119 UCX  INFO  Version 1.15.0 (loaded from /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/lib/libucp.so.0)
ucp_worker.c:1855 UCX  INFO    0x2037070 inter-node cfg#3 tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_2:1)

Workaround A: explicitly use SysV and KNEM transport

HPC-X doesn’t need a workaround so far, but if you don’t want to get a runtime error message, explicitly using sysv and knem is the one way to go.

#!/bin/bash
#SBATCH --partition=thor
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export UCX_LOG_LEVEL=info
export UCX_TLS=self,sysv,knem,ib
srun --mpi=pmi2 apptainer run hpcx-imb.sif alltoall

It runs successfully as expected.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 16 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.03         0.03
            1         1000         4.20         5.82         5.10
            2         1000         3.51         4.74         4.19
            4         1000         4.05         5.35         4.63
            8         1000         4.87         5.71         5.29
           16         1000         4.34         5.56         4.92
           32         1000         4.94         7.82         6.47
           64         1000         5.69         8.41         6.72
          128         1000         6.24         9.76         7.39
          256         1000         7.63         9.03         8.28
          512         1000         8.89        14.61        11.10
         1024         1000        10.38        15.64        12.82
         2048         1000        12.31        18.51        15.07
         4096         1000        19.75        27.75        22.88
         8192         1000        42.02        50.78        44.64
        16384         1000        65.80        80.80        72.45
        32768         1000       132.69       175.87       146.15
        65536          640       147.89       153.71       150.48
       131072          320       273.06       282.86       278.59
       262144          160       770.78       802.63       787.77
       524288           80      1435.77      1584.30      1539.48
      1048576           40      2828.50      2931.49      2881.55
      2097152           20      4228.95      5338.12      5182.40
      4194304           10      8364.29     10358.30     10013.75

Workaround B: exclude `posix` and `cma` and leave a room for automatic transport selection

For zero-copy transfer, there is an another transport xpmem, so it’s better to exclude posix and cma and leave a room for automatic transport selection to make sure you use the best transport for your task. This is an improved version of workaround A.

#!/bin/bash
#SBATCH --partition=thor
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

export UCX_LOG_LEVEL=info
export UCX_TLS=^'poisx,cma'
srun --mpi=pmi2 apptainer run hpcx-imb.sif alltoall

It runs successfully as expected.

#----------------------------------------------------------------
# Benchmarking Alltoall 
# #processes = 16 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.03         0.04         0.03
            1         1000         3.73         5.71         4.79
            2         1000         3.37         5.21         4.36
            4         1000         3.35         4.84         4.14
            8         1000         3.58         5.55         4.62
           16         1000         3.99         5.91         5.03
           32         1000         4.96         7.41         6.38
           64         1000         5.51        12.23         9.18
          128         1000         6.38         8.96         7.61
          256         1000         7.89        11.13         9.19
          512         1000         7.93        15.19        12.40
         1024         1000         9.41        17.23        14.25
         2048         1000        13.53        17.91        15.82
         4096         1000        22.22        26.42        24.31
         8192         1000        40.75        49.02        44.47
        16384         1000        68.31        77.20        73.00
        32768         1000       134.80       191.12       163.20
        65536          640       150.61       153.65       152.39
       131072          320       273.53       281.53       278.26
       262144          160       687.33       808.74       777.42
       524288           80      1381.60      1631.20      1556.62
      1048576           40      2811.75      3019.41      2947.42
      2097152           20      4282.89      5225.62      5057.58
      4194304           10      8219.96     10075.15      9753.87

Performance impact

We showed some intra-node transport doesn’t work with apptainer (Apptainer without setuid). In some cases, loss of those transports impacts performance. We will show you some examples of such cases here.

Posix transport vs. SysV transport

The graph shows Intel MPI Benchmark Alltoall results for posix, cma with apptainer-suid vs sysv, and cma with apptainer-suid to illustrate the performance difference between posix and sysv transport.

UCX has the threshold for using the zero-copy transfer UCX_ZCOPY_THRESH and the default value is UCX_ZCOPY_THRESH=16384 ; the following graphs are separated below threshold and over threshold.

2 processes per node, data size smaller than 8192 bytes

In the 2 processes per node case, sysv performed slightly better than posix. (This result takes the best result from 3 trials.)

2 processes per node, data size larger than 8192 bytes

In the 2 processes per node case, cma performed almost the same, as expected. (This result takes the best result from 3 trials.)

8 processes per node, data size smaller than 8192 bytes

In the 8 processes per node case, posix and sysv performed almost the same. (This result takes the best result from 3 trials.)

8 processes per node, data size larger than 8192 bytes

In 8 processes per node case, both use cma but unexpectedly performed slightly differently. (This result takes the best result from 3 trials.)

64 processes per node, data size smaller than 8192 bytes

In the 64 processes per node case, posix performed slightly better than sysv. (This result takes the best result from 3 trials.)

64 processes per node, data size larger than 8192 bytes

In the 64 processes per node case, cma performed almost identically, as expected. (This result takes the best result from 3 trials.)

Loss of zero-copy transfer

This graph shows Intel MPI Benchmark Alltoall results for sysv, cma with apptainer-suid vs. sysv, ib with apptainer to illustrate the performance difference between “with cma" and “without cma."

2 processes per node, data size smaller than 8192 bytes

In the 2 processes per node and data size smaller than 8192 bytes case, both used sysv and showed almost identical performance, as expected. (This result takes the best result from 3 trials.)

2 processes per node, data size larger than 8192 bytes

In the 2 processes per node and data size larger than 8192 bytes case, with cma and without cma showed a huge performance difference.

64 processes per node, data size smaller than 8192 bytes

In the 64 processes per node and data size smaller than 8192 bytes case, both used sysv and expected to show almost identical performance, but we observed a performance difference at 8192 bytes. (This result takes the best result from 3 trials.)

64 processes per node, data size larger than 8192 bytes

In the 64 processes per node and data size larger than 8192 bytes case, with cma and without cma showed a huge performance difference.

The case where there is no impact on performance

This graph shows Intel MPI Benchmark Alltoall results for sysv, knem with apptainer vs sysv, knem, hcollwith apptainer-suid to illustrate when using HPC-X there is not much difference between the two, at least for Intel MPI Benchmark on just 4 nodes, thanks to knem zero-copy transfer.

128 processes (32 processes per node x 4 nodes), data size smaller than 8192 bytes

128 processes (32 processes per node x 4 nodes) and data size smaller than 8192 bytes, without setuid and with setuid, performed almost identically.

128 processes (32 processes per node x 4 nodes), data size larger than 8192 bytes

128 processes (32 processes per node x 4 nodes) and data size larger than 8192 bytes, without setuid and with setuid, performed almost identically.

Summary

If we carefully choose transports for intra-node communication case by case and for each MPI library and communication framework, MPI applications with apptainer perform the same as MPI applications with apptainer-suid do in some cases. In particular cases, we observed significant performance drop with apptainer. The best user experience for most of the HPC users would be the automatic selection of intra-node communication transports, and always performing as equivalent as apptainer-suid. Hence, a workaround for making it possible to use any intra-node communication transports while securely using MPI applications with apptainer (without setuid) is demanded.

The processes spawned by the MPI launcher with apptainer are spawned under different user namespace (see details in the next section). If the MPI launcher can spawn processes within the same user namespace, those processes can communicate using any intra-node communication transports (see details in later section).

User namespace and mount namespace of processes spawned by MPI launcher with Apptainer

The ID of user namespaces and mount namespaces can be check with file command or lsns command. Create file.def file with the following content.

Bootstrap: docker
From: rockylinux/rockylinux:9

%post
    dnf install -y file

%runscript
    file "$@"

and build file.sif for verifying the difference between apptainer (the default mode of v1.1.x and v1.2.x) and apptainer-suid (the default mode of v1.0.x and earlier).

Running 3 processes of file.sif using mpirun and checking mount namespace ID for each process, we can verify each process is using a different mount namespace.

$ mpirun -np 3 apptainer run file.sif /proc/self/ns/mnt
/proc/self/ns/mnt: symbolic link to mnt:[4026534271]
/proc/self/ns/mnt: symbolic link to mnt:[4026534274]
/proc/self/ns/mnt: symbolic link to mnt:[4026534276]

Running 3 processes of file.sif using mpirun and checking user namespace ID for each process, we can verify each process is using a different user namespace.

$ mpirun -np 3 apptainer run file.sif /proc/self/ns/user
/proc/self/ns/user: symbolic link to user:[4026534266]
/proc/self/ns/user: symbolic link to user:[4026534269]
/proc/self/ns/user: symbolic link to user:[4026534270]

For reference, we can check the default mount namespace ID and default user namespace ID (the namespaces that the host uses). We can verify the above 3 processes differ from the default namespace that the host uses.

$ file /proc/self/ns/mnt
/proc/self/ns/mnt: broken symbolic link to mnt:[4026531841]
$ file /proc/self/ns/user
/proc/self/ns/user: broken symbolic link to user:[4026531837]

Let’s move on to apptainer-suid , the default mode of Apptainer v1.0.x and earlier. Running 3 processes of file.sif using mpirun and checking the mount namespace ID for each process, we can verify each process is using a different mount namespace.

$ mpirun -np 3 apptainer run file.sif /proc/self/ns/mnt
/proc/self/ns/mnt: broken symbolic link to mnt:[4026532971]
/proc/self/ns/mnt: broken symbolic link to mnt:[4026532970]
/proc/self/ns/mnt: broken symbolic link to mnt:[4026532969]

Running 3 processes of file.sif using mpirun and checking user namespace ID for each process, we can verify every process is using the same user namespace as the host.

$ mpirun -np 3 apptainer run file.sif file /proc/self/ns/user
/proc/self/ns/user: broken symbolic link to user:[4026531837]
/proc/self/ns/user: broken symbolic link to user:[4026531837]
/proc/self/ns/user: broken symbolic link to user:[4026531837]

The basic idea of a workaround

A workaround for the issue is using the Apptainer instance command. The instance command was originally implemented for running services such as web servers or databases.

The instance start command allows you to create a new named instance from an existing container image that will begin running in the background. If a startscript is defined in the container metadata, the commands in that script will be executed with the instance start command as well.

We can execute a command inside an instance using apptainer exec instance://INSTANCE_NAME COMMAND and the user namespace and mount space for those processes are the same as the instance’s one. If the user namespace is the same among all the processes spawned by the MPI launcher on each node, intra-node communication transport works with apptainer (without setuid). This idea was suggested by David Dykstra from Fermilab at the Apptainer Community Meeting in response to our report.

apptainer instance start hpc.sif hpc (new user namespace #1/new mount namespace #1)

mpirun -np 2 apptainer run instance://hpc IMB-MPI1 alltoall
|_ apptainer
|  |_ process 1 (user namespace #1/mount namespace #1)
|_ apptainer
   |_ process 2 (user namespace #1/mount namespace #1)

Let’s verify with an example. Start the Apptainer instance by the following command.

$ apptainer instance start ys/file.sif file
INFO:    instance started successfully

Spawn processes inside the instance by using MPI launcher mpirun and check the mount namespace.

$ mpirun -np 4 apptainer run instance://file /proc/self/ns/mnt
/proc/self/ns/mnt: symbolic link to mnt:[4026534270]
/proc/self/ns/mnt: symbolic link to mnt:[4026534270]
/proc/self/ns/mnt: symbolic link to mnt:[4026534270]
/proc/self/ns/mnt: symbolic link to mnt:[4026534270]

And also check the user namespace.

$ mpirun -np 4 apptainer run instance://file /proc/self/ns/user
/proc/self/ns/user: symbolic link to user:[4026534266]
/proc/self/ns/user: symbolic link to user:[4026534266]
/proc/self/ns/user: symbolic link to user:[4026534266]
/proc/self/ns/user: symbolic link to user:[4026534266]

It works as expected. Next, let’s verify this workaround works with the intra-node communication transport that didn't work before.

The Apptainer instance workaround for intra-node communication issue with MPI applications and Apptainer without setuid

The Apptainer instance workaround is the best performing workaround known in the meantime.

Single node test

This is single node test case. Start the instance, and then spawn processes inside the instance. This workaround works for both posix and cma intra-node communication transport.

apptainer instance start ys/hpcx-imb.sif hpcx

UCX_LOG_LEVEL=info \
mpirun -np 16 apptainer run instance://hpcx alltoall

Multi-node test with Slurm

This is a Slurm job script for a multi-node test case. Spawn the instance on each node and spawn processes inside the instance on each node. This workaround works for posix , cma, and hcoll.

#!/bin/bash
#SBATCH --partition=thor
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

for node in $(srun --nodes=4 --ntasks=4 --ntasks-per-node=1 hostname)
do
    ssh $node apptainer instance start ys/hpcx-imb.sif $SLURM_JOB_ID
done

UCX_LOG_LEVEL=info srun --mpi=pmi2 apptainer run instance://$SLURM_JOB_ID alltoall

srun --nodes=4 --ntasks=4 --ntasks-per-node=1 apptainer instance stop $SLURM_JOB_ID

The future plan for Apptainer v1.3.0

The above workaround works great with intra-node communication transports but it would be great if Apptainer had the option to do the above workaround or equivalent automatically in MPI context. https://github.com/apptainer/apptainer/issues/1583 issued by Cedric Clerget is the RFE for this feature. Apptainer is planning to release this feature on the Apptainer v1.3.0 release. Stay tuned!

Appendix: How we built a container for each environment

Building an Intel MPI benchmark container for OpenMPI, UCX, and Omni-Path environment

Create an ompi-ucx-psm2-imb.def file with the following content. This definition file is following the Fully containerized model.

Bootstrap: docker
From: rockylinux/rockylinux:8

%post
    dnf -y install wget git gcc gcc-c++ make file gcc-gfortran bzip2 \
        dnf-plugins-core findutils librdmacm-devel epel-release unzip
    crb enable
    dnf -y group install "Development tools"
    dnf -y install hwloc slurm-pmi slurm-pmi-devel

    dnf install -y libpsm2 libpsm2-devel numactl-devel

    cd /usr/src
    git clone https://github.com/openucx/ucx.git ucx
    cd ucx
    git checkout v1.10.1
    ./autogen.sh
    mkdir build
    cd build
    ../configure --prefix=/opt/ucx
    make -j $(nproc)
    make install

    cd /usr/src
    git clone --recurse-submodules -b v4.1.5 https://github.com/open-mpi/ompi.git
    cd ompi
    ./autogen.pl
    mkdir build
    cd build
    ../configure --prefix=/opt/ompi --with-ucx=/opt/ucx \
      --with-slurm --with-pmi=/usr -with-hwloc=internal
    make -j $(nproc)
    make install

    cd /opt
    git clone https://github.com/intel/mpi-benchmarks.git
    cd mpi-benchmarks/src_c
    export PATH=/opt/ompi/bin:$PATH
    make all

%runscript
    /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@"

Build ompi-imb.sif.

apptainer buil opmi-imb.sif ompi-ucx-psm2-imb.def

Building the Intel MPI benchmark container for IntelMPI, libfabric, and Omni-Path environment

Create an impi-libfabric-psm2-imb.def file with the following content.

Bootstrap: docker
From: rockylinux/rockylinux:8

%post
    dnf -y install wget git gcc gcc-c++ make file gcc-gfortran bzip2 \
        dnf-plugins-core findutils librdmacm-devel epel-release
    crb enable
    dnf -y group install "Development tools"
    dnf -y install hwloc slurm-pmi slurm-pmi-devel

    dnf install -y libpsm2 libpsm2-devel numactl-devel #For PSM2
    dnf install -y libfabric #For OPX

    cat << EOS > /etc/yum.repos.d/oneAPI.repo
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOS
    dnf install -y intel-hpckit-2022.2.0

    source /opt/intel/oneapi/setvars.sh
    source /opt/intel/oneapi/mpi/2021.6.0/env/vars.sh

    cd /opt
    git clone https://github.com/intel/mpi-benchmarks.git
    cd mpi-benchmarks/src_c
    make all

%runscript
    source /opt/intel/oneapi/setvars.sh
    source /opt/intel/oneapi/mpi/2021.6.0/env/vars.sh
    # To use system libfabric 1.17.0 instead of Intel oneAPI bundled libfabric 1.13.2
    export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
    /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@"

Build impi-imb.sif.

apptainer build impi-imb.sif impi-libfabric-psm2-imb.def

Building Intel MPI benchmark container for HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand environment

Create an hpcx-ucx-mlx5.def file with the following content. HPC-X provides a pre-built OpenMPI that works out of the box. We rebuild OpenMPI with the Slurm PMI2 library for better integration with Slurm’s srun launcher.

Bootstrap: docker
From: rockylinux/rockylinux:9

%files
    hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64.tbz  /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64.tbz

%post
    dnf -y install wget git gcc gcc-c++ make file gcc-gfortran bzip2 \
        dnf-plugins-core findutils librdmacm-devel epel-release
    crb enable
    dnf -y group install "Development tools"
    dnf -y install hwloc slurm-devel
    dnf -y install binutils-devel

    # If you wish to install the following packages from repository hosted on mellanox.com comment out below line
    # dnf config-manager --add-repo https://linux.mellanox.com/public/repo/mlnx_ofed/23.04-0.5.3.3/rhel9.2/mellanox_mlnx_ofed.repo
    dnf -y install rdma-core rdma-core-devel libibverbs libibverbs-utils libibumad  librdmacm librdmacm-utils

%post
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
    dnf -y install cuda

%post
    cd /opt
    tar xvf hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64.tbz
    cd hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64
    HPCX_HOME=$(pwd)
    source $HPCX_HOME/hpcx-init.sh
    hpcx_load
    cd sources
    tar zxvf openmpi-gitclone.tar.gz
    cd openmpi-gitclone
    ./configure --prefix=${HPCX_HOME}/ompi \
        --with-libevent=internal \
        --without-xpmem \
        --with-hcoll=${HPCX_HOME}/hcoll \
        --with-ucx=${HPCX_HOME}/ucx \
        --with-ucc=${HPCX_HOME}/ucc \
        --with-platform=contrib/platform/mellanox/optimized \
        --with-slurm --with-pmi=/usr \
        --with-cuda=/usr/local/cuda-12.2
    make -j $(nproc)
    make install

%runscript
    HPCX_HOME=/opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64
    source $HPCX_HOME/hpcx-init.sh
    hpcx_load
    "$@"

Build hpcx-ucx-mlx5.sif.

apptainer build hpcx-ucx-mlx5.sif hpcx-ucx-mlx5.def

Create an imb.def file with the following content.

Bootstrap: docker
From: rockylinux/rockylinux:9

%post
    dnf -y install wget git gcc gcc-c++ make file gcc-gfortran bzip2 \
        dnf-plugins-core findutils librdmacm-devel epel-release
    crb enable
    dnf -y group install "Development tools"

%post
    git clone --recurse-submodules -b v4.1.5 https://github.com/open-mpi/ompi.git
    cd ompi
    ./autogen.pl
    mkdir build
    cd build
    ../configure --prefix=/opt/openmpi/4.1.5
    make -j $(nproc)
    make install

%post
    cd /opt
    git clone https://github.com/intel/mpi-benchmarks.git
    cd mpi-benchmarks/src_c
    export PATH=/opt/openmpi/4.1.5/bin:$PATH
    make all

%runscript
    /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@"

Build imb.sif.

apptainer build imb.sif imb.def

Create hpcx-imb.def with the following content. This is a swapping approach to prevent rebuilding Intel MPI Benchmark again and again.

Bootstrap: localimage
From: hpcx-ucx-mlx5.sif
Stage: mpi

Bootstrap: localimage
From: imb.sif
Stage: imb

Bootstrap: docker
From: rockylinux/rockylinux:9
Stage: runtime

%files from mpi
    /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64 /opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64
    /usr/local/cuda-12.2 /usr/local/cuda-12.2

%files from imb
   /opt/mpi-benchmarks /opt/mpi-benchmarks

%post
    dnf -y install epel-release
    crb enable
    dnf -y install rdma-core libibverbs libibverbs-utils libibumad  librdmacm librdmacm-utils
    dnf -y install hwloc numactl slurm-devel
    #dnf -y install nvidia-driver cuda-drivers
    dnf -y clean all

%runscript
    HPCX_HOME=/opt/hpcx-v2.16-gcc-mlnx_ofed-redhat9-cuda12-gdrcopy2-nccl2.18-x86_64
    source $HPCX_HOME/hpcx-init.sh
    hpcx_load
    /opt/mpi-benchmarks/src_c/IMB-MPI1 "$@"

Build hpcx-imb.sif.

apptainer build hpcx-imb.sif hpcx-imb.def

Built for Scale. Chosen by the World’s Best.

1.4M+

Rocky Linux instances

Being used world wide

90%

Of fortune 100 companies

Use CIQ supported technologies

250k

Avg. monthly downloads

Rocky Linux

A New Approach to MPI in Apptainer

Apptainer 1.1.0 Is Here, with Improved Security and User Experience

Apptainer Official PPA for Ubuntu Is Now Available!

Benchmarking Containerized MPI with Apptainer on Dell PowerEdge and Cornelis Omni-Path

Workaround for Communication Issue with MPI Apps & Apptainer without Setuid

Table of contents

Contributors

Subscribe to our newsletter

Introduction

Apptainer and unprivileged user namespace

Test with various MPIs, communication frameworks, interconnects, and intra-node communication transports

Test Case 1: OpenMPI, UCX, and Omni-Path

Posix Transport Error

Workaround for Posix Transport Error

Verify Posix Transport Works with apptainer-suid

UCX version difference

CMA transport error

The trade-off between intra-node communication transports

Workaround A: manual fallback, excluding Posix and CMA for intra-node communication

Workaround B: automatic fallback and user namespaces detection by UCX PR #9213

Test case 2: IntelMPI, libfabric, and Omni-Path

PSM2 with libfabric 1.13.2 and PSM2 with libfabric 1.17.0

Workaround A: OPX OFI provider with libfabric 1.17.0

Test case 3: HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand

KNEM transport

Explicitly use posix and cma

Explicitly use sysv and cma

Workaround A: explicitly use SysV and KNEM transport

Workaround B: exclude posix and cma and leave a room for automatic transport selection

Performance impact

Posix transport vs. SysV transport

2 processes per node, data size smaller than 8192 bytes

2 processes per node, data size larger than 8192 bytes

8 processes per node, data size smaller than 8192 bytes

8 processes per node, data size larger than 8192 bytes

64 processes per node, data size smaller than 8192 bytes

64 processes per node, data size larger than 8192 bytes

Loss of zero-copy transfer

2 processes per node, data size smaller than 8192 bytes

2 processes per node, data size larger than 8192 bytes

64 processes per node, data size smaller than 8192 bytes

64 processes per node, data size larger than 8192 bytes

The case where there is no impact on performance

128 processes (32 processes per node x 4 nodes), data size smaller than 8192 bytes

128 processes (32 processes per node x 4 nodes), data size larger than 8192 bytes

Summary

User namespace and mount namespace of processes spawned by MPI launcher with Apptainer

The basic idea of a workaround

The Apptainer instance workaround for intra-node communication issue with MPI applications and Apptainer without setuid

Single node test

Multi-node test with Slurm

The future plan for Apptainer v1.3.0

Appendix: How we built a container for each environment

Building an Intel MPI benchmark container for OpenMPI, UCX, and Omni-Path environment

Building the Intel MPI benchmark container for IntelMPI, libfabric, and Omni-Path environment

Building Intel MPI benchmark container for HPC-X, UCX, and NVIDIA ConnectX-6 InfiniBand environment

Built for Scale. Chosen by the World’s Best.

Related posts

Verify Posix Transport Works with `apptainer-suid`

Explicitly use `posix` and `cma`

Explicitly use `sysv` and `cma`

Workaround B: exclude `posix` and `cma` and leave a room for automatic transport selection