By Martijn de Vries | May 14, 2020 | Bright 9.0
Typically, when an organization extends an on-premises cluster into Amazon Web Services (AWS) or Microsoft Azure, there isn’t the same storage mounted on-premises as there is in the cloud. If you have a very fast connection and if you would mount your on-premises storage on your cloud nodes, there wouldn’t be a problem, as all data would be available both on-premises and in the cloud. However, if you don't have a fast link, the input data for a job would need to be moved to the cloud, and the output data would eventually need to come back to the on-prem cluster.
In an ideal world, you would hide this activity from the end-users of a cluster. So, a few versions ago, Bright created the 'cmsub' command to handle it. cmsub worked nicely, but there were a few shortcomings. For example, if multiple jobs operated on the same input data, the data would get transferred from on-prem to the cloud multiple times. Also, if you had a complex workflow of jobs (where output data from one job was the input data for another job), the data would be bouncing back and forth between on-premises and the cloud.
In Bright Cluster Manager 9.0, it is now possible to label both input and output datasets so that they can be reused. If you are running multiple jobs that need the same input data, refer to the label in your job description and Bright Cluster Manager will stage the data so that it is available for the cloud nodes involved in running the job. Similarly, in complex workflows, jobs can use labels to refer to the output data for another job (which may not have been run yet).
As a result, in a pipeline of jobs, only the initial input data and the final output data (of the last job in the chain) will be transferred between on-prem and the cloud. Interim results will remain in the cloud. Input data can be staged either on storage nodes - which can be dynamically spun up on the fly - or on FSx instances (Lustre as a Service offering by AWS), or on ANF volumes (high performance NFS offering by Azure).
The new cmjob is an enhanced replacement for cmsub, since it now provides extra features such as the capability to manage labeled datasets and to control ANF volumes and FSx instances.
I invite you to read more about running jobs in cluster extension cloud nodes using cmjob, in our user guide, here: https://support.brightcomputing.com/manuals/9.0/user-manual.pdf#section.4.7.
If you would like further information, or if you have any questions, please contact us.