Batch farm | Hades

By adminuser

September 15, 2023

virgo cluster

The batch farm runs on the virgo.hpc.gsi.de cluster. The batch jobs are managed by the SLURM (SL) scheduler. The documentation on the cluster/SLURM can be found here.
New Hades users have to be added to the users of account hades to be able to run jobs on the farm. The list of hades users is maintained by J.Markert@gsi.de

Some rules to work with virgo.hpc.gsi.de:

Files are written to /lustre filesystem. Each Hades user owns a directory /lustre/hades/user/${USER} to work on the farm (NO BACKUP!)
SLURM does not support other files systems than /lustre. batch
scripts can not use user's home dir nor any other filesystem
The hadessoftware is distributed to the batchfarm via
/cvmfs/hades.gsi.de/ (debian8) or /cvmfs/hadessoft.gsi.de (debian10)
The /cvmfs/hades.gsi.de or /cvmfs/hadessoft.gsi.de are available on desktop machines dependening on the OS version
The batch jobs can be submited to the farm from the virgo.hpc.gsi.de
cluster. This machines provides our software, the user ${HOME} and
a file sytem mount to /lustre. You can compile and test (run) your
programs here.
A set of example batch scripts for Pluto, UrQmd, HGeant, DSTs and user
analysis you can retrieve from
svn checkout https://subversion.gsi.de/hades/hydra2/trunk/scripts/batch/GE
The folders contain sendScript_SL.sh+jobScript_SL.sh (SL).
The general concept is to work with file lists as input to the sendScript, which
takes care about the sync from user homedir to the submission dir on
/lustre. The files in the list are splitted automatically to job arrays to
minimize the load on the scheduler. The sendScript finally calls the sbatch
command of SLURM to submit the job. The jobScript is the part which
runs on the batch nodes.

Working on the batch farm of GSI:

virgo cluster documentation:

https://hpc.gsi.de/virgo/user-guide/storage.html

ssh key autentication

https://hpc.gsi.de/virgo/user-guide/access/key-authentication.html

Since 1st of September 2021 we have two software environments
available:

debian10: neweset software releases compiled with root 6.24.02 and gcc 8.3
use this for current beam mar19 analysis of dst files. The login to
the farm will use vae23.hpc.gsi.de (debian10,ROOT6) and no special action
for handling the container environment with singularity is needed.
It works as the not anymore vailable virgo-debian8.hpc.gsi.de

debian8: old software releases compiled with root 5.34.34 and gcc 4.9.2
This covers all versions up to hydra2-5.6a used for dst productions
of apr12,aug14,jul14, mar19 (gen5) pre 1st of September 2021
use this to get exact same behaviour as before the switch and do
not want to change any software. To make the environment work
some handling of the singularity options are needed to start and
submit jobs using this container environment. It will be described
below.

#############################################################

virgo cluster usage information:
virgo.hpc.gsi.de(bare bone to start your container),
vae23.gsi.de (debian10 container started at login)

Problems and tips:

if your login to virgo.hpc.gsi.de
does not work although you have/created ssh keys
(https://hpc.gsi.de/virgo/access/key_authentication.html),
try to cleanup .ssh/authorized_keys from other
keys which need other credentials and might cause
problems.
ksh does not work with the login,
use bash (accounts-service@gsi.de).
ksh is not installed at virgo and will
lead to permission denied statements without further
explanation. accounts-service@gsi.de will
be responsable to change your login shell.
It will take a while to see the changes, syncs
are performed once per day at 22:00.
X does not work at virgo at the moment :
a. use lxbuster.gsi.de (debian10 or any debian10 desktop machine) to look at /lustre output
use sshfs to mount /lustre to any linux machine which
has no mount of /lustre already.
To mount /lustre us
```
sshfs user@virgo.hpc.gsi.de:/lustre mymountpoint
sshfs user@lustre.hpc.gsi.de:/lustre mymountpoint
```

vae23.hpc.gsi.de will not work with sshfs
since the machine will be closed after the sshfs command has been returned.

our sendScript_SL.sh for batch submission before using singularity containers
needs a small modification for virgo:

# from inside virgo container
    command="--array=1-${stop} ${resources} -D ${submissiondir}  --      output=${pathoutputlog}/slurm-%A_%a.out -- ${jobscript} ${submissiondir}/${jobarrayFile} ${pathoutputlog} ${arrayoffset}"
   for virgo there is an additional " -- " between slurm options and 
    the user scripts+parameters of the script which should be started
    at the farm. This is needed since slurm seperates from the container
    which is started. The container version is choosen from the submit host
    automatically.
#############################################################

debian10 container login:

ssh username@vae23.hpc.gsi.de            (virgo)

This command will start a container based instance
of gsi debian10.

debian8 container login:

working environment debian8:
1. login ssh username@virgo.hpc.gsi.de.
2. start debian8 container:

 . start_debian8.sh

start_debian8.sh:
//-------------------------------
export SSHD_CONTAINER_DEFAULT="/cvmfs/vae.gsi.de/debian8/containers/user_container-production.sif"
export SSHD_CONTAINER_OPTIONS="--bind /etc/slurm,/var/run/munge,/var/spool/slurm,/var/lib/sss/pipes/nss,/cvmfs/vae.gsi.de,/cvmfs/hadessoft.gsi.de/install/debian8/install:/cvmfs/hades.gsi.de/install,/cvmfs/hadessoft.gsi.de/param:/cvmfs/hades.gsi.de/param,/cvmfs/hadessoft.gsi.de/install/debian8/oracle:/cvmfs/it.gsi.de/oracle"
shell=$(getent passwd $USER | cut -d : -f 7)
STARTUP_COMMAND=$(cat << EOF
   srun() { srun-nowrap --singularity-no-bind-defaults "\$@"; }
   sbatch() { sbatch-nowrap --singularity-no-bind-defaults "\$@"; }
   export -f srun sbatch
   $shell -l
EOF
)
export SINGULARITYENV_PS1="\u@\h:\w > "
export SINGULARITYENV_SLURM_SINGULARITY_CONTAINER=$SSHD_CONTAINER_DEFAULT
test -f /etc/motd && cat /etc/motd
echo Container launched: $(realpath $SSHD_CONTAINER_DEFAULT)
exec singularity exec $SSHD_CONTAINER_OPTIONS $SSHD_CONTAINER_DEFAULT $shell -c "$STARTUP_COMMAND"
//-------------------------------

This environment will allow to compile and use code. SL_mon.pl
and access the SLURM commands will not work on debian8.
virgo.hpc.gsi.de + vae23.hpc.gsi.de allow to use SL_mon.pl+SLURM

3. our sendScript_SL.sh for batch submission works
for the vae23.hpc.gsi.de login. NO ADDITIONAL wrap.sh NEEDED
(see below)!
------------------------------------------------------------

submission of batchjobs for debian8 virgo3:

1. login to virgo.hpc.gsi.de
2. modifiy your sendScript_SL.sh to use wrap.sh
to start the debian8 container on the farm.

wrap.sh:
//-------------------------------
#!/bin/bash
jobscript=$1
jobarrayFile=$2
pathoutputlog=$3
arrayoffset=$4
singularity exec \
-B /cvmfs/hadessoft.gsi.de/install/debian8/install:/cvmfs/hades.gsi.de/install \
-B /cvmfs/hadessoft.gsi.de/param:/cvmfs/hades.gsi.de/param \
-B /cvmfs/hadessoft.gsi.de/install/debian8/oracle:/cvmfs/it.gsi.de/oracle \
-B /lustre \
/cvmfs/vae.gsi.de/debian8/containers/user_container-production.sif  ${jobscript} ${jobarrayFile} ${pathoutputlog} ${arrayoffset}
//-------------------------------

in sendScrip_SL.sh in the lower part where the sbatch command is build:

#virgo bare bone submit using wrap.sh
wrap=./wrap.sh
command="--array=1-${stop} ${resources} -D ${submissiondir} --output=${pathoutputlog}/slurm-%A_%a.out -- ${wrap} ${jobscript} ${submissiondir}/${jobarrayFile} ${pathoutputlog} ${arrayoffset}"

SLURM tips:

The most relevant commands to work with SL:

sbatch   : sbatch  submits  a  batch script to SLURM.
squeue   : used to view job and job step information for jobs managed by SLURM.
scancel  : used to signal or cancel jobs, job arrays or job steps.
sinfo    : used to view partition and node information for a system running SLURM.
sreport  : used to generate reports of job usage and cluster utilization for
		   SLURM jobs saved to the SLURM Database.
scontrol : used  to  view  or modify Slurm configuration including: job, 
           job step, node, partition,reservation, and overall system configuration.
 
Examples:
squeue  -u  user             : show all jobs of user
squeue -t  R                 : show jobs in a certain state (PENDING (PD),
                               RUNNING (R), SUSPENDED (S),COMPLETING (CG),
                               COMPLETED (CD), CONFIGURING (CF),
                               CANCELLED (CA),FAILED (F), TIMEOUT (TO),
                               PREEMPTED (PR), BOOT_FAIL (BF) ,
                               NODE_FAIL (NF) and SPECIAL_EXIT (SE))
scancel  -u user             : cancel all jobs of user user
scancel jobid                : cancel job with jobid
scancel -t PD -u <username>  : cancel all pending jobs of a user
scontrol show job -d <jobid> : show detailed info about a job
scontrol hold <jobid>        : suspend a job
scontrol resume <jobid>      : resume a suspended job

Printer-friendly version