Job Manager for remote execution of ATK scripts

Version: 2017.0

In this tutorial you will learn how to use the Job Manager for execution of ATK jobs on remote computing clusters. In particular, you will learn how to:

  • add a remote machine to the Machine Manager;
  • use custom Machine Settings for individual jobs;
  • add several different machines and import/export machine settings.

Important

You will set up a remote machine for running jobs in parallel using MPI, as well as with threading. We strongly recommend you go through the tutorial Job Manager for local execution of ATK scripts before continuing with this one.

Since ATK 2017 Intel’s mpiexec.hydra is provided on both Windows and Linux versions - this is the recommended way to run ATK in parallel. The mpiexec.hydra binary shipped with ATK is located in the folder libexec/mpiexec.hydra present in your installation folder.

Note

There are two essential requirements when using the Job Manager for executing and managing ATK jobs on a remote cluster. You need:

  1. ATK and VNL installed on the local machine and on the cluster;
  2. an SSH connection from your local machine to the cluster.

Please refer to the tutorial SSH keys if you need help setting up the SSH connection.

introbar

A single remote machine

Open the jm_cluster_icon Machine Manager, and click New in order to add a new machine.

snap25

Note

Please note that the Machine Manager options may differ between versions of VNL.

The menu that appears has six options, Local, Remote PBS, Remote LSF, Remote SLURM, Remote Direct, and ATK On-Demand.

snap_machinemanager

Choose the type of remote machine that matches the job scheduling on your remote Linux cluster, and start setting up the connection. For more information about the ATK On-Demand option click here. Here, you will choose the Remote PBS machine to set up a machine with a PBS job scheduling system. The option Remote Direct is appropiate for clusters that have no queue system.

The Machine Settings widget pops up. It has five main tabs:

  • Settings (remote connection),
  • Environment (software on the remote cluster),
  • Resources (allocated computing resources and time),
  • Notifications (job progress updates),
  • Diagnostics (check the current setting).

tabs

Hint

Most options have default values (which must be checked). Red fields are mandatory, but have no default.

Settings

Connection to the remote cluster.

snap27

You need to specify the following fields:

Machine name
Hostname
Username
Use ssh key

The following options vary from cluster to cluster.

Private key path
The directory containing your private SSH key.
Node type names
Optional, but in this example you have access to node type “orange”.
Queue names
Optional, could for example be “long”.
Path to PBS binaries

The directory containing the qsub and other PBS executables must be specified. Log on to the cluster and use the which command to locate it (e.g. /usr/local/torque-4.2.8/bin):

$ which qsub
/usr/local/torque-4.2.8/bin/qsub

Once all settings are added, you can chek if they are correct by navigating to the Diagnostics tag and clicking the “Run Diagnostics” button.

snap28

Tip

The Diagnostics tab checks if the options in the Settings and Environment tabs allow your local computer to connect to the remote cluster and execute the commands needed for job submission and management.

If some field is not specified, the diagnostics will check the default setting. If the connection to the cluster works well with that default, or is at least not disrupted, it will be marked as OK.

You therefore need to run an actual test job to make absolutely sure that all settings are indeed OK.

Environment

Computing environment on the cluster.

snap28b

This tab concerns the environment (directories, executables, modules, etc.) on the remote cluster, not on your local computer. $HOME is therefore your home directory on the cluster, and VNL-ATK must of course be installed on the cluster.

Since ATK 2017 Intel’s mpiexec.hydra is provided on both Windows and Linux versions - this is the recommended way to run ATK in parallel. The mpiexec.hydra binary shipped with ATK is located in the folder libexec/mpiexec.hydra present in your installation folder.

Any scripts that should be sourced in order to get the environment working must be listed. The same goes for required export statements (may be needed to correctly set the QUANTUM_LICENSE_PATH) and cluster modules that should be loaded.

Resources

Computing resources requested at job submission.

../../_images/jmr_10.png

This tab specifies the default computing resources (nodes, cores, queue, time, etc.) requested at job submission. Note that MKL_DYNAMIC is disabled by default, which means that MKL is not allowed to dynamically decrease the number of threads at runtime.

Tip

For a job with only MPI parallelization (no threading), the number of nodes times the number of cores per node should equal the (total) number of MPI processes. In the example above, you have 2 x 8 = 16 cores, so you ask for 16 MPI processes and specify that each node should have 8 of those processes (only one MPI process per core).

Notifications

Job progress reports.

../../_images/jmr_12.png

The Job Manager will regularly check the job progress on the remote cluter and report it in the log file. You can also recieve e-mail notifications from the PBS scheduler when the job starts, finishes, or aborts.

Diagnostics

Use this tab to test the machine settings. Green check-mark indicates that all settings appear to be OK. Red circle indicates some problem that should be fixed in one of the tabs.

snap28

Save and test the new machine

Click OK to add the machine to the Machine Manager.

../../_images/jmr_14.png

Next, you should run the test scripts mpi_check.py and test_mpi.py to test the machine settings. In the VNL main window, drag and drop a script onto the job_manager_icon Job Manager. Choose the newly added machine (in this example “Salbacore”), and click OK.

../../_images/jmr_select_mahine.png

Click jm_play_enabled_icon to submit the job and watch how the Task State changes from Pending to Finished. For a longer job, you can click Download log to regularly retrieve the log file during the remote job execution.

../../_images/test1.png

Once a job has finised you should check the log file to inspect the job output to see if the expected number of processes were used. Click the jm_log_icon icon to open the log. In the two examples below, 16 cores were used in total on two different nodes.

../../_images/test1_log.png

snap29

The job outputs have of course also appeared on the LabFloor:

snap30

Custom job settings

As explained in the tutorial Job Manager for local execution of ATK scripts, you can customize many of the job settings before submitting a job. Use the script silicon.py as an example ATK script. Download the script, send it to the Job Manager, and choose the newly created remote machine. Then click the jm_preferences_enabled_icon Job Settings plug-in.

You can now customize the Resources and Notifications tabs, and thereby submit the job with settings different from the default ones you specified above. For example, you can change the number of requested cores and cluster queue and/or maximum wall-clock time. You can also change the notifications settings.

Set up the job settings as you like, e.g. 4 cores on a single node and 4 MPI processes. Then click OK and submit the calculation to the remote cluster.

The job log is automatically retrieved when the job finishes. For long jobs, however, you need to click “Download log” if you want to see the log while the job is running.

../../_images/run_silicon_finished_ok.png

Debugging

If an error occurs during the execution of the job, this will be indicated by a red square in the queue, as shown below. You can then click the bug Debug logs icon to open the job debugs information window, which will show you details about the error.

../../_images/run_silicon_error.png ../../_images/run_silicon_error_mensa.png

Adding several remote machines

Several remote machines can be added to the Machine Manager. You can of course add new machines from scratch, but you can also export/import the settings of an existing machine and use those as a template for new machines.

Tip

The import/export functionality is very convenient for sharing machine settings within a group of researchers.

In the following, you will rename and export the settings of the newly created machine, and then add one more remote machine suitable for threaded calculations.

Rename and export

In the Job Manager, remove the jobs that are in the queue of the newly added machine (named “Sabalcore” in our example).

Click jm_cluster_icon Machine Settings and choose to Edit the machine.

snap33

Note

You always need to empty the queue of a machine before you can edit its settings.

This machine is already set up for MPI parallelization. Therefore, append ”: MPI” to the machine name, click Export, and save the settings in a file.

../../_images/jmr_export.png

Click OK to accept the renaming and return to the Machine Manager window.

../../_images/jmr_machine_manager_mpi.png

A machine for threaded jobs

Next, add a new remote machine with default settings for “Remote PBS”. Then click Import and import the settings file you just saved.

You can now modify the settings to create a machine with default settings that are suitable for a threaded calculation:

  • In the Environment tab, remove the OMP_NUM_THREADS=1 export statement.
  • In the Resources tab, use only a single MPI process but enable dynamic scheduling of MKL threads.

snap38

snap39

Give the new machine a reasonable name, e.g. “Salbacore: Threading”, and click OK to add the machine to the Machine Manager. This machine should be convenient for submitting ATK-ForceField calculations.

../../_images/jmr_machine_manager_threading.png