Skip to main content

virgo2 cluster

The batch farm runs on the virgo2.hpc.gsi.de cluster. The batch jobs are managed by the SLURM (SL) scheduler. The documentation on the cluster/SLURM can be found here.
New Hades users have to be added to the users of account hades to be able to run jobs on the farm. The list of hades users is maintained by J.Markert@gsi.de

Some rules to work with virgo2.hpc.gsi.de:

  • Files are written to /lustre filesystem. Each Hades user owns a directory /lustre/hades/user/${USER} to  work on the farm (NO BACKUP!)
  • SLURM does not support other files systems than /lustre. batch
    scripts can not use user's home dir nor any other filesystem
  • The hadessoftware is distributed to the batchfarm via
    /cvmfs/hades.gsi.de/ (debian8) or /cvmfs/hadessoft.gsi.de (debian10)
    The /cvmfs/hades.gsi.de or /cvmfs/hadessoft.gsi.de are available on desktop machines dependening on the OS version
  • The batch jobs can be submited to the farm from the virgo2.hpc.gsi.de
    cluster. This machines provides our software, the user ${HOME} and
    a file sytem mount to /lustre. You can compile and test (run) your
    programs here.
  • A set of example batch scripts for Pluto, UrQmd, HGeant, DSTs and user
    analysis you can retrieve from
    svn checkout https://subversion.gsi.de/hades/hydra2/trunk/scripts/batch/GE
    The folders contain sendScript_SL.sh+jobScript_SL.sh (SL).
    The general concept is to work with file lists as input to the sendScript, which
    takes care about the sync from user homedir to the submission dir on
    /lustre. The files in the list are splitted automatically to job arrays to
    minimize the load on the scheduler. The sendScript finally calls the sbatch
    command of SLURM to submit the job. The jobScript is the part which
    runs on the batch nodes.

 

SLURM tips:

The most relevant commands to work with SL:

sbatch   : sbatch  submits  a  batch script to SLURM.
squeue   : used to view job and job step information for jobs managed by SLURM.
scancel  : used to signal or cancel jobs, job arrays or job steps.
sinfo    : used to view partition and node information for a system running SLURM.
sreport  : used to generate reports of job usage and cluster utilization for
		   SLURM jobs saved to the SLURM Database.
scontrol : used  to  view  or modify Slurm configuration including: job, 
           job step, node, partition,reservation, and overall system configuration.
 
Examples:
squeue  -u  user             : show all jobs of user
squeue -t  R                 : show jobs in a certain state (PENDING (PD),
                               RUNNING (R), SUSPENDED (S),COMPLETING (CG),
                               COMPLETED (CD), CONFIGURING (CF),
                               CANCELLED (CA),FAILED (F), TIMEOUT (TO),
                               PREEMPTED (PR), BOOT_FAIL (BF) ,
                               NODE_FAIL (NF) and SPECIAL_EXIT (SE))
scancel  -u user             : cancel all jobs of user user
scancel jobid                : cancel job with jobid
scancel -t PD -u <username>  : cancel all pending jobs of a user
scontrol show job -d <jobid> : show detailed info about a job
scontrol hold <jobid>        : suspend a job
scontrol resume <jobid>      : resume a suspended job