Introduction
-
We present a hybrid architecture that is composed of an HPC cluster and a Cloud cluster, where container orchestration on the HPC cluster can be performed by the container orchestrator (i.e. Kubernetes) located in the Cloud cluster. Little modification is required on the HPC systems;
-
We propose a dual-level scheduling for container jobs on HPC systems;
-
We describe the implementation of a tool named Torque-Operator which bridges TORQUE and Kubernetes.
Background
Workload managers for HPC
TORQUE
qsub
. The job is normally written in a PBS (Portable Batch System) script. A job ID will be returned to the user as the standard output of qsub
.
Slurm
Implementation of AI in HPC
Containerisation
-
Run with user privileges and need no daemon process. Acquisition of root permission is only necessary when building images, which can be performed on user working computers;
-
Seamless integration with HPC. Singularity natively supports GPU, Message Passing Interface (MPI) [29] and InfiniBand. In contrast with Docker, it demands no additional network configurations;
-
Portable via a single image file. Au contraire, Docker is built up on top of layers of files.
Container orchestration for cloud clusters
-
Resource limit control. This feature reserves the CPU and memory for a container, which restrains interference among containers and provides information for scheduling decisions;
-
Scheduling. Determine the policies on how to optimise the placement of containers on specific nodes;
-
Load balancing. Distribute the load among container instances;
-
Health check. Verify if a faulty container needs to be destroyed or replaced;
-
Fault tolerance. Create containers automatically if applications or nodes fail;
-
Auto-scaling. Add or remove containers automatically.
Related work
Architecture and tool description
HPC workload manager | Container orchestrator | |
---|---|---|
Deployment | Batch queue (queueing time from seconds to days) | Often immediate |
Workload type | Binary | Container, pod |
Resource unit | Bare-metal nodes | Pods, VM nodes |
Application execution length | Run to completion | Continuously running |
Application specifics | Distributed memory jobs (e.g. MPI jobs) | Cloud micro services |
DevOps environment provision | No | Yes |
Architecture and the testbed setting
-
Provide a unified interface for users to access the Cloud and the HPC clusters. Jobs are submitted in the form of yaml;
-
Except Singularity, no additional software packages are needed to be installed in the HPC system. Hence it gives little impact on the working environment of existing HPC users;
-
Users have flexibility to run containerised and non-containerised applications (on the HPC side);
-
The performance of compute-intensive or data-intensive applications can be significantly improved via execution on the HPC cluster;
-
Containers scheduled by Kubernetes to HPC clusters can take advantage of the container scheduling strategies of Kubernetes, where TORQUE lacks its efficiency.
Architecture of torque-operator
-
generate the virtual node;××
-
fetch the queue information of the TORQUE cluster and add the information to the virtual node as its node label. A node label indicates the restrictions of a node. The jobs that meet such restrictions can be scheduled to the node. For example, the label of the virtual node (wlm.sylabs.io/nodes=2 in Fig. 6) denotes that the TORQUE cluster contains 2 compute nodes, thus a Kubernetes job which requests more than 2 compute nodes will be in a pending status;
-
launch TORQUE jobs to the Kubernetes cluster by pods;
-
transfer the TORQUE jobs to the TORQUE cluster and return the results obtained on the TORQUE cluster to the directory which is specified in the yaml file submitted by users as indicated in Fig. 7 Line 20.×
kubectl get pods
to display the pod information, they can perform kubectl get torquejob
to show the status of jobs submitted to TORQUE. Users can also view the status of TORQUE jobs using the PBS commands qstat
on the login node (marked in red dashed line in Fig. 4).cow_job.yaml
is illustrated in Fig. 7. The PBS job is included from Line 7 to Line 14. More precisely, the job requests 30 minutes walltime and one compute node. It executes a Singularity image called lolcow.sif
located in the home
directory, which is pulled from Syslab registry. Its results and error messages will be written in low.out
and low.err
, respectively. Line 12 sets the necessary PATH
environment variable to find the Singularity executable located in /usr/local/bin
on the HPC system.qsub
that submits the job to the TORQUE cluster. When the job completes, Torque-Operator creates a Kubernetes pod which redirects the results to the directory that the user specifies.User permission management
Use cases
Use cases | Software packages and dependencies |
---|---|
Pilot Wheat Ear | Pytorch, Python,libjpeg,libpng-dev,libnccl2,libibverbs,libnuma,librdmacm,libmlx4,libmlx5 |
Pilot Soybean Farming | Python, Numpy, Pandas, gdal, scipy, scikit-learn, openpyxl, xlrd |
BPMF | Open MPI |
%environment
) starts the section on definition of the environment variables that will be set at runtime (from Line 5 to Line 8). Commands in the %post
(Line 10) section are executed after the base OS (i.e. Ubuntu 18.04) has been installed at build time, more specifically, from Line 11, all the necessary software packages and dependencies are installed.
Pilot and benchmark description
Names | Description | ML applications |
---|---|---|
Pilot Wheat Ear | Provision of autonomous robotic systems within arable frameworks | Yes |
Pliot Soybean Farming | Yield prediction for soybean farming | Yes |
BPMF | MPI bechmark: predict compound-on-protein activity | Yes |
-
The MPI version that is used to compile the application in the container must be compatible with the version of MPI available on the host;
-
Users must know where the host MPI is installed;
-
User must ensure that binding the directory where the host MPI is installed is possible.
Performance evaluation
Cluster name | HPC cluster (bare-metal) | VM cluster |
---|---|---|
Total number of nodes | 3 (2 compute nodes) | 4 (3 worker nodes) |
Number of cores per node | 20 (10 cores per CPU, 2 CPU per node) | 2 |
RAM per node (NUMA) | 128 GB | 7.79 GB |
CPU frequency | Intel(R)Xeon(R) CPU E5-2630 v4, 2.20GHz | Intel i7 9xx (Nehalem Core i7, IBRS) 2.79GHz |
Operating System | Ubuntu 18.04.3 LTS | Ubuntu 18.04.3 LTS |
VM type | - | QEMU 2.11.1, KVM 2.11.1 |
Cluster types | Cloud cluster | HPC cluster |
---|---|---|
Orchestrator | Kubernetes | Torque |
Container runtime & interface | Singularity, Singulairy-CRI | Singularity |
Plugin | Torque-Operator, Romana | - |
Compiler | Golang compiler | Golang compiler |
Cores (on 1 node) | Pilot Wheat Ear | Pilot Soybean Farming | BPMF |
---|---|---|---|
Cloud (VM) 1 core | - | 28.79 (1.87) | - |
Cloud (VM) 2 cores | 1337.13 (0.00) | - | 41.78 (0.03) |
HPC (bare-metal) 1 core | - | 14.84 (0.02) | - |
HPC (bare-metal) 2 cores | 480.50 (4.97) | - | 25.82 (0.01) |
HPC (bare-metal) 4 cores | 282.39 (0.28) | - | 13.90 (0.00) |
HPC (bare-metal) 8 cores | 179.54 (1.02) | - | 7.99 (0.00) |
HPC (bare-metal) 16 cores | 144.76 (0.12) | - | 5.03 (0.00) |
HPC (bare-metal) 20 cores | 134.99 (0.55) | - | 4.84 (0.01) |
Discussion
ssh
, although this could cause performance degradation when there is a large amount of data transmission.