Job control for parallel and remote execution

The following topics are discussed:

Installing job control files
MPI configuration

Users have the option to execute Phoenix NLME jobs remotely using NLME Job Control System (JCS) or Phoenix JMS. Below is a comparison of the two options:

GCC

R (batchtools, XML, reshape, Certara.NLME8)

ssh

MPI (Open MPI for Linux platforms, MPICH for Windows)

MPI for within job parallelization

Linux Grid (TORQUE, SGE, LSF)* or MultiCore for between job parallelization

*TORQUE = Terascale Open source Resource and QUEue Manager, SGE = Sun Grid Engine, LSF = Platform Load Sharing Facility.

This section focuses on the Job Control Setup, for information on JMS, refer to “Job Management System (JMS)”. Phoenix NLME jobs can be executed on a number of different platform setups, enabling the program to take full advantage of available computing resources. All of the run modes can be executed locally as well as remotely.

One NLME job can be executed using:

Single_Local_Config.png 

Single_Remote_Config.png 

For Windows platforms, the default profile with MPI parallelization is as follows:

Parallel_MPI_Windows_Config.png 

For Linux remote runs, an example profile for MPI parallelization is as follows:

Parallel_MPI_Linux_Config.png 

Execution can be parallelized by job level for the following run modes:

Simple (Sorted datasets)
Scenarios
Bootstrap
Stepwise Covariate search
Shotgun Covariate search
Profile

The implemented methods of “by job” parallelization are:

An example of Windows local configuration is as follows.

Parallel_Multicore_Windows_Config.png 

An example of Linux remote configuration is as follows:

Parallel_Multicore_Linux_Config.png 

An example profile for submission to the TORQUE grid is as follows:

TORQUE_Linux_Config.png 

Caution:In some grid configurations, if the number of available cores specified for a grid exceeds the total number of available cores, it can cause the job to remain in the queue. If the job cannot be can­celed from within Phoenix, then a direct cancellation through ssh is required. Care must be taken especially for burstable grids, where additional resources (slots) can be requested but not used. Periodic monitoring of the running jobs for the current user is recommended.

The NLME jobs submitted to the grid can be parallelized using MPI if the system has the appropri­ate MPI service installed and the Parallel mode is set to one of the three *_MPI options (LSF_MPI, SGE_MPI, or TORQUE_MPI (to parallelize the runs as by job as well as by Sample/Subject within each job).

For any of the *_MPI modes, the number of cores to be used for each job in parallelization will be calculated as the smallest of the following 2 numbers:

(1) the number of cores in the configuration divided by the number of jobs, or

(2) the number of unique subjects in a specific job divided by 3. If there is an uneven number of unique subjects in each replicate, the smallest number of subjects will be used for the calculation.

Example 1: There are 300 cores available, according to the configuration profile, 4 jobs requested (replicates), and 200 subjects in each replicate. Each of the 4 replicates would parallelize across 66 cores (300/4 = 75. 200/3 = 66. 66 < 75). Total cores used = 264.

Example 2: There are 100 cores available, according to configuration profile, 3 jobs requested (replicates), and 300 subjects in each replicate. Each of the 3 replicates would parallelize across 33 cores (100/3 = 33. 300/3 = 100. 33 < 100). Total cores used = 99.

An example of the configuration profile is as follows:

LSF_Linux_Config.png 

Caution:For some grid configurations, the number of calculated MPI cores for the particular job cannot exceed the total number of hosts available on the grid. This can cause the software to ask for more hosts to do the computation than are available and result in the job freezing or exiting with an error. In such cases, it is advised to switch to the grid mode without MPI.

Installing job control files

Additional software and libraries are required for certain platform setups.

For within a job parallelization on the local host, the MPICH2 1.4.1 software package is required and is installed during a complete installation of Phoenix or can be selected during a custom Phoenix installation. This application facilitates message-passing for distributed-memory applications used in parallel computing. (If needed, the mpich2-1.4.1p1-win-x86-64.msi file is located in <Phoenix_install_dir>\Redistributables). MPICH2 needs to be installed on all Win­dows machines included in the MPI-ring or used to submit a job to the MPI-ring. If you have another MPICH service running, you must disable it. Currently, MPI is only supported on Windows.

For parallel processing on a remote host or for using multicore parallelization locally, the fol­lowing must be installed:

If the current version of the package is intended to be the default version along the grid, install it with elevated privileges.

MPI configuration

MPI configuration is done by the Phoenix installer at the time of installation.

Note:If one of the MPI processes crashes, other MPI processes and the parent process mpiexec.exe may be left running, putting Phoenix NLME in an unpredictable state. Use the Task Manager to check if mpiexec.exe is still running and stop it, if it is. This will stop any other MPI processes.

Phoenix Model job status

The NLME Job Status window is automatically displayed when a Phoenix Model object is executed or by selecting Window > View NLME Jobs from the menu.

The window provides easy monitoring of all NLME jobs that are executing as well as a history of jobs that have completed during the current Phoenix session.

During local execution of a Phoenix model, the status window displays the parameter values and gra­dients of each major iteration of the algorithm, with the most recent iteration at the top. The model name, protocol, stage, time that the execution started and stopped, the status of the job, number of runs, and the number of runs that completed or failed are listed at the top. (See “Job control for paral­lel and remote execution” for more execution options.)

The job can be canceled (no results are saved) or stopped early (results obtained up to the time when the job is ended are saved) by right-clicking Status menu. Specifically, if the Stop early button is clicked during an iteration, the run is ended once the current iteration is completed. If the Stop early button is pushed during the standard error calculation step, then Phoenix stops the run and prepares outputs without standard error results.

Status_Window.png 

When Stop Early is executed, the Overall results worksheet will show the Return Code of 6, indicat­ing that the fitting was not allowed to run to convergence.

Next to the name of the model that is executed in a run, several pieces of information are shown, pro­viding a quick summary of the job’s current state. The Status of a job can be In Progress, Finished, or Canceled.

If a job involves multiple iterations, information for each completed iteration is added as a row. Use the button at the beginning of the job row to expand/collapse the iteration rows.

Right-clicking a row displays a menu from which the job can be stopped early (the results up to the last completed iteration are saved) or canceled (no results are saved).


Last modified date:7/9/20
Certara USA, Inc.
Legal Notice | Contact Certara
© 2020 Certara USA, Inc. All rights reserved.