Containers & HPC

    

The benefits of containerization are pretty clear these days: achieving process isolation, avoiding dependency hell, and having the same environment locally as in production. This blog post gives an overview of some of the containerization solutions available today.

Docker

Since Docker popularized containers, there have been many concerns regarding security. Docker has a daemon for managing containers running as root. The Docker command-line binary talks to this daemon using a REST API. Regular users can become root inside the container and for example, mount paths from the host that they normally don't have access to. For some sysadmins, this has been a reason to not allow Docker on their clusters.

Singularity

Singularity is a container solution specifically designed with HPC in mind. It integrates nicely into existing workflows and requires almost no extra effort on the HPC user. For example using Slurm with the Singularity plugin, one extra header for the batch script is sufficient to make the entire batch job run inside a given image:

#SBATCH --singularity-image=/path/to/my/image

Images can be created as directories (as in the above example) or as a single .img file, making it convenient to copy them around. The actual program or script to run the job can also be included in the image, but this is not mandatory. The fun thing about the single-file image approach is that it can be made executable and treated as a regular executable (./my-singularity-container.img) with all dependencies self-contained.

Container security in Singularity

Singularity executes the images directly without an intermediate daemon and uses SUID permissions on its executable. The privilege escalation is only used in parts of the code that really need it. Before any user code is loaded into memory, all privileges are already dropped, and also extra kernel flags like PR_SET_NO_NEW_PRIVS (since Linux 3.5) are set to prevent child processes from inheriting extra privileges.

Kubernetes

Kubernetes running on top of Docker inherits its security concerns. Users don’t have to be in the docker group themselves, but if they can spawn a Pod and mount host paths, they can become root outside of the container through the paths they mounted.

Container security in Kubernetes

Using Kubernetes’ Pod Security Policies feature you can disallow running privileged Pods, restricting user-id ranges, namespaces, control the allowed type and path of a mount, and more. For finer control, it is possible to use a WebHook or reach out to tools like Gatekeeper (a Policy Controller for Kubernetes). Outside of Kubernetes, there are also projects like gVisor to improve the situation by adding sandboxing in the container runtime. This does imply some performance loss, most notably with I/O.

HPC != cloud-native?

Kubernetes is typically considered to be a good fit for cloud-native applications, which usually means the applications have some coupling with Cloud infrastructure (like provided by Amazon, Google, Microsoft). The rise in popularity of Service Meshes for Kubernetes also illustrates a difference in focus: sacrificing network performance for features such as traceability.

A pipe-line working on streaming data, spawning Pods for Jupyter Lab notebooks with GPUs on a pool of nodes for data-scientists for exploratory work, are some examples where some overhead is acceptable. And for HPC workloads where overhead is not acceptable, and fast-networking, accelerators, and so on are important, there is the Device Plugin framework in Kubernetes. This way, you can still use InfiniBand, FPGAs, NVIDIA/AMD GPUs.

Regarding the Kubernetes Job scheduler, there are also efforts like kube-batch that bring HPC batch scheduling to Kubernetes. Something to keep in mind for Kubernetes clusters is the fact that Kubernetes supports a maximum of 5.000 nodes, 150.000 pods, 300.000 containers or 100 pods per node. At KubeCon Barcelona 2019 it was announced that the focus will be on Scalability (among Extensibility and Reliability), so the current limitations will hopefully be extended in the near future.

Singularity CRI (Container Runtime Interface)

Singularity is fully OCI compliant (Open Containers Initiative) since version 3.1.0, which was released last February. Singularity has the option to integrate into Kubernetes using Singularity-CRI. This requires running a service on each Kubernetes node and reconfiguring the kubelet services to use Singularity as container runtime (and not the default which is Docker). This way the Docker Daemon is no longer needed making Kubernetes perhaps more suitable for HPC.

Conclusion

Docker is not the best choice for a container runtime engine for HPC workloads. Singularity is easy to use with existing workload managers or scripts, making it a good place to start for containerizing HPC workloads. Moving from Singularity to Kubernetes is also seamless with Singularity CRI.

There are already organizations running HPC workloads on Kubernetes today. Whether or not it makes sense to jump to Kubernetes for scheduling will be different per organization. Bright Cluster Manager supports Kubernetes and Singularity, and will also support Singularity-CRI in Bright Cluster Manager 9.0.