CIQ

Remote Direct Memory Access, RDMA

January 19, 2024

One of the most important technologies underlying HPC and that the HPC interconnect provides. RDMA allows for compute nodes to directly move data involved in a given computational job between each other without the involvement of the OS running on those nodes. So, in essence, a portion of RAM is set aside by the NICs of each machine so that certain RDMA operations can be done between them without the operating system having to be involved. HPC interconnects, in addition to speed, provide the RDMA capabilities that allow data to pass over RDMA through the interconnect, thus allowing, for example, a massive MPI-based computational job to pass data between its nodes without having to run a very large amount of OS-level networking code, thus massively speeding up the application.

It can often be found that even in the presence of a high-speed interconnect, RDMA is the necessary component to enable HPC levels of speed, as the operating system of each compute node trying to process so many network requests will inevitably bog down long before the full network speed capabilities are realized without RDMA.