Submitting jobs using --job
¶
Topics:
Quick introduction to --job
¶
Using the --job
command line option, you can instruct EasyBuild to submit jobs for the installations that should
be performed, rather than performing the installations locally on the system you are on.
If dependency resolution is enabled using --robot
(see also Enabling dependency resolution, --robot
/ -r
and --robot-paths
),
EasyBuild will submit separate jobs and set dependencies between them to ensure they are run in the order dictated
by the software dependency graph(s).
Configuring --job
¶
Selecting the job backend (--job-backend
)¶
The job backend to be used can be specified using the --job-backend
EasyBuild configuration option.
Since EasyBuild 3.8.0, three backends are supported:
GC3Pie
(default) (supported since EasyBuild 2.2.0)GC3Pie
version 2.5.0 (or more recent) required (https://gc3pie.readthedocs.org)- works with different resource managers and job schedulers, including TORQUE/PBS, Slurm, etc.
- note: requires that a GC3Pie configuration file is provided, see Configuring the job backend
PbsPython
pbs_python
version 4.1.0 (or more recent) required (https://oss.trac.surfsara.nl/pbs_python)- note: requires TORQUE resource manager (https://adaptivecomputing.com/cherry-services/torque-resource-manager)
Slurm
(supported since EasyBuild 3.8.0)- requires Slurm version 17.0 (or more recent), (https://slurm.schedmd.com/)
Configuring the job backend (--job-backend-config
)¶
To configure the job backend, the path to a configuration file must be specified via --job-backend-config
.
- for
PbsPython
backend: (irrelevant, no configuration file required) - for
GC3Pie
backend: see https://gc3pie.readthedocs.org/en/latest/users/configuration.html- example configuration files are available at Example configurations for GC3Pie job backend
- for
Slurm
backend: (irrelevant, no configuration file required)
Number of requested cores per job (--job-cores
)¶
The number of cores that should be requested for each job that is submitted can be specified using --job-cores
(default: not specified).
The mechanism for determining the number of cores to request in case --job-cores
was not specified depends on
which job backend is being used:
- if the
PbsPython
job backend is used, the (most common) number of available cores per workernode in the target resource is determined; this usually results in jobs requesting full workernodes (at least in terms of cores) by default - if the
GC3Pie
orSlurm
job backend is used, the requested number of cores is left unspecified, which results in falling back to the default mechanism used by GC3Pie/Slurm to pick a number of cores; most likely, this results in single-core jobs to be submitted by default
Job dependency type (-job-deps-type
)¶
The type of dependency that is set by EasyBuild when submitting a job that depends on one or more other jobs
can be specified via the --job-deps-type
configuration setting:
- with
--job-deps-type=abort_on_error
, job dependencies will be set such that a job that depends on other jobs will be aborted if one of those jobs completes with an error- for both
PbsPython
andSlurm
, this is equivalent with setting job dependencies usingafterok
- for both
- with
--job-deps-type=always_run
, job dependencies will be set such that a job that depends on other jobs are always run, regardless of whether or not those jobs completed successfully- for both
PbsPython
andSlurm
, this is equivalent with setting job dependencies usingafterany
- for both
The default value for -job-deps-type
depends on the job backend being used
(see [Configuring the job backend][submitting_jobs_cfg_job_backend_config):
- for the
GC3Pie
andSlurm
backends,--job-deps-type=abort_on_error
is the default; - for the
PbsPython
backend,--job-deps-type=always_run
is the default (because of historical reasons, and for the sake of backward compatibility)
Maximum walltime of jobs (--job-max-walltime
)¶
An integer value specifying the maximum walltime for jobs (in hours) can be specified via --job-max-walltime
(default: 24).
For easyconfigs for which a reference required walltime is available via the build_stats
parameter in a matching
easyconfig file from the easyconfig repository (see Easyconfigs repository (--repository
, --repositorypath
)),
EasyBuild will set the walltime of the corresponding job to twice that value (unless the resulting value is higher
than the maximum walltime for jobs).
If no such reference walltime is available, the maximum walltime is used.
Job output directory (--job-output-dir
)¶
The directory where job log files should be placed can be specified via --job-output-dir
(default: current directory).
Job polling interval (--job-polling-interval
)¶
The frequency with which the status of submitted jobs should be checked can be specified via --job-polling-interval
,
using a floating-point value representing the number of seconds between two checks (default: 30 seconds).
Note
This setting is currently only relevant to GC3Pie; see also [Submitting jobs to a GC3Pie
backend]][submitting_jobs_usage_gc3pie].
Target resource for job backend (--job-target-resource
)¶
The target resource that should be used by the job backend can be specified using --job-target-resource
.
- for
PbsPython
backend: hostname of TORQUE PBS server to submit jobs to (default:$PBS_DEFAULT
) - for
GC3Pie
backend: name of resource to submit jobs to (default: none, which implies weighted round-robin submission across all available resources) - for
Slurm
backend: (not used)
Usage of --job
¶
To make EasyBuild submit jobs to the job backend rather than performing the installations directly, the --job
command line option can be used.
This following assumes that the required configuration settings w.r.t. the job backend to use are in place, see
Configuring --job
.
Submitting jobs to a PbsPython
or Slurm
backend¶
When using the PbsPython
or Slurm
backend, EasyBuild will submit separate jobs for each installation
to be performed, and then exit reporting a list of submitted jobs.
To ensure that the installations are performed in the order dictated by the software dependency graph, dependencies
between installations are specified via job dependencies (see also Job dependency type (-job-deps-type
)).
See also Example: submitting installations to TORQUE via pbs_python.
Note
Submitted jobs will be put on hold until all jobs have been submitted. This is required to ensure that the dependencies between jobs can be specified correctly; if a job would run to completion before other jobs that depend on it were submitted, the submission process may fail.
Submitting jobs to a GC3Pie
backend¶
When using the GC3Pie
backend, EasyBuild will create separate tasks for each installation to be performed and
supply them to GC3Pie, which will then take over and pass the installations through as jobs to the available
resource(s) (see also Configuring the job backend).
To ensure that the installations are performed in the order dictated by the software dependency graph, dependencies between installations are specified to GC3Pie as inter-task dependencies. GC3Pie will then gradually feed the installations to its available resources as their dependencies have been satisfied.
Any log messages produced by GC3Pie are included in the EasyBuild log file, and are tagged with gc3pie
.
See also Example: submitting installations to SLURM via GC3Pie.
Note
The eb
process will not exit until the full set of tasks that GC3Pie was provided with has been processed.
An overall progress report will be printed regularly (see also Job polling interval (--job-polling-interval
)).
As such, it is advised to run the eb
process in a screen/tmux session when using the GC3Pie backend for
--job
.
Examples¶
Example configurations for GC3Pie job backend¶
When using GC3Pie as a job backend, a configuration file must be provided via --job-backend-config
.
This section includes a couple of examples of GC3Pie configuration files (see also
https://gc3pie.readthedocs.org/en/latest/users/configuration.html).
Example GC3Pie configuration for local system¶
[resource/localhost]
enabled = yes
type = shellcmd
frontend = localhost
transport = local
max_memory_per_core = 10GiB
max_walltime = 100 hours
# max # jobs ~= max_cores / max_cores_per_job
max_cores_per_job = 1
max_cores = 4
architecture = x86_64
auth = none
override = no
resourcedir = /tmp/gc3pie
Example GC3Pie configuration for TORQUE/PBS¶
[resource/pbs]
enabled = yes
type = pbs
# use settings below when running GC3Pie on the cluster front-end node
frontend = localhost
transport = local
auth = none
max_walltime = 2 days
# max # jobs ~= max_cores / max_cores_per_job
max_cores_per_job = 16
max_cores = 1024
max_memory_per_core = 2 GiB
architecture = x86_64
# to add non-std options or use TORQUE/PBS tools located outside of
# the default PATH, use the following:
#qsub = /usr/local/bin/qsub -q my-special-queue
Example GC3Pie configuration for SLURM¶
[resource/slurm]
enabled = yes
type = slurm
# use settings below when running GC3Pie on the cluster front-end node
frontend = localhost
transport = local
auth = none
max_walltime = 2 days
# max # jobs ~= max_cores / max_cores_per_job
max_cores_per_job = 16
max_cores = 1024
max_memory_per_core = 2 GiB
architecture = x86_64
# to add non-std options or use SLURM tools located outside of
# the default PATH, use the following:
#sbatch = /usr/bin/sbatch --mail-type=ALL
Example: submitting installations to SLURM via GC3Pie¶
When submitting jobs to the GC3Pie
job backend, the eb
process will not exit until all tasks have been
completed. A job overview will be printed every N seconds (see Job polling interval (--job-polling-interval
)).
Jobs are only submitted to the resource manager (SLURM, in this case) when all task dependencies have been resolved.
$ export EASYBUILD_JOB_BACKEND=GC3Pie
$ export EASYBUILD_JOB_BACKEND_CONFIG=$PWD/gc3pie.cfg
$ eb GCC-4.6.0.eb OpenMPI-1.8.4-GCC-4.9.2.eb --robot --job --job-cores=16 --job-max-walltime=10
== temporary log file in case of crash /tmp/eb-ivAiwD/easybuild-PCgmCB.log
== resolving dependencies ...
== GC3Pie job overview: 2 submitted (total: 9)
== GC3Pie job overview: 2 running (total: 9)
== GC3Pie job overview: 2 running (total: 9)
...
== GC3Pie job overview: 4 terminated, 4 ok, 1 submitted (total: 9)
== GC3Pie job overview: 4 terminated, 4 ok, 1 running (total: 9)
...
== GC3Pie job overview: 8 terminated, 8 ok, 1 running (total: 9)
== GC3Pie job overview: 9 terminated, 9 ok (total: 9)
== GC3Pie job overview: 9 terminated, 9 ok (total: 9)
== Done processing jobs
== GC3Pie job overview: 9 terminated, 9 ok (total: 9)
== Submitted parallel build jobs, exiting now
== temporary log file(s) /tmp/eb-ivAiwD/easybuild-PCgmCB.log* have been removed.
== temporary directory /tmp/eb-ivAiwD has been removed.
Checking which jobs have been submitted to SLURM at regular intervals reveals that indeed only tasks for which all dependencies have been processed are actually submitted as jobs::
$ squeue -u $USER
JOBID USER ACCOUNT NAME REASON START_TIME END_TIME TIME_LEFT NODES CPUS PRIORITY
6161545 easybuild example GCC-4.9.2 None 2015-07-01T1 2015-07-01T2 9:58:55 1 16 1242
6161546 easybuild example GCC-4.6.0 None 2015-07-01T1 2015-07-01T2 9:58:55 1 16 1242
$ squeue -u $USER
JOBID USER ACCOUNT NAME REASON START_TIME END_TIME TIME_LEFT NODES CPUS PRIORITY
6174527 easybuild example Automake-1.15- Resources N/A N/A 10:00:00 1 16 1120
$ squeue -u $USER
JOBID USER ACCOUNT NAME REASON START_TIME END_TIME TIME_LEFT NODES CPUS PRIORITY
6174533 easybuild example OpenMPI-1.8.4- None 2015-07-03T0 2015-07-03T1 9:55:59 1 16 1119
Example: submitting installations to TORQUE via pbs_python¶
Using the PbsPython
job backend, eb
submits jobs directly to TORQUE for processing, and exits as soon as all
jobs have been submitted::
$ eb GCC-4.6.0.eb OpenMPI-1.8.4-GCC-4.9.2.eb --robot --job
== temporary log file in case of crash /tmp/eb-OMNQAV/easybuild-9fTuJA.log
== resolving dependencies ...
== List of submitted jobs (9): GCC-4.6.0 (GCC/4.6.0): 508023.example.pbs; GCC-4.9.2 (GCC/4.9.2): 508024.example.pbs;
libtool-2.4.2-GCC-4.9.2 (libtool/2.4.2-GCC-4.9.2): 508025.example.pbs; M4-1.4.17-GCC-4.9.2 (M4/1.4.17-GCC-4.9.2): 50
8026.example.pbs; Autoconf-2.69-GCC-4.9.2 (Autoconf/2.69-GCC-4.9.2): 508027.example.pbs; Automake-1.15-GCC-4.9.2 (Au
tomake/1.15-GCC-4.9.2): 508028.example.pbs; numactl-2.0.10-GCC-4.9.2 (numactl/2.0.10-GCC-4.9.2): 508029.example.pbs;
hwloc-1.10.0-GCC-4.9.2 (hwloc/1.10.0-GCC-4.9.2): 508030.example.pbs; OpenMPI-1.8.4-GCC-4.9.2 (OpenMPI/1.8.4-GCC-4.9.
2): 508031.example.pbs
== Submitted parallel build jobs, exiting now
== temporary log file(s) /tmp/eb-OMNQAV/easybuild-9fTuJA.log* have been removed.
== temporary directory /tmp/eb-OMNQAV has been removed.
$ qstat -a
example.pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
508023.example.pbs easybuild batch GCC-4.6.0 -- 1 16 -- 24:00:00 R 00:02:16
508024.example.pbs easybuild batch GCC-4.9.2 -- 1 16 -- 24:00:00 Q --
508025.example.pbs easybuild batch libtool-2.4.2-GC -- 1 16 -- 24:00:00 H --
508026.example.pbs easybuild batch M4-1.4.17-GCC-4. -- 1 16 -- 24:00:00 H --
508027.example.pbs easybuild batch Autoconf-2.69-GC -- 1 16 -- 24:00:00 H --
508028.example.pbs easybuild batch Automake-1.15-GC -- 1 16 -- 24:00:00 H --
508029.example.pbs easybuild batch numactl-2.0.10-G -- 1 16 -- 24:00:00 H --
508030.example.pbs easybuild batch hwloc-1.10.0-GCC -- 1 16 -- 24:00:00 H --
508031.example.pbs easybuild batch OpenMPI-1.8.4-GC -- 1 16 -- 24:00:00 H --
Holds are put in place to ensure that the jobs run in the order dictated by the dependency graph(s). These holds are released by the TORQUE server as soon as they jobs on which they depend have completed::
$ qstat -a
example.pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
508025.example.pbs easybuild batch libtool-2.4.2-GC -- 1 16 -- 24:00:00 Q --
508026.example.pbs easybuild batch M4-1.4.17-GCC-4. -- 1 16 -- 24:00:00 Q --
508027.example.pbs easybuild batch Autoconf-2.69-GC -- 1 16 -- 24:00:00 H --
508028.example.pbs easybuild batch Automake-1.15-GC -- 1 16 -- 24:00:00 H --
508029.example.pbs easybuild batch numactl-2.0.10-G -- 1 16 -- 24:00:00 H --
508030.example.pbs easybuild batch hwloc-1.10.0-GCC -- 1 16 -- 24:00:00 H --
508031.example.pbs easybuild batch OpenMPI-1.8.4-GC -- 1 16 -- 24:00:00 H --
...
$ qstat -a
example.pbs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
508031.example.pbs easybuild batch OpenMPI-1.8.4-GC -- 1 16 -- 24:00:00 R 00:03:46