Skip to main content

virgo cluster

The batch farm runs on the virgo.hpc.gsi.de cluster. The batch jobs are managed by the SLURM (SL) scheduler. The documentation on the cluster/SLURM can be found here.
New Hades users have to be added to the users of account hades to be able to run jobs on the farm. The list of hades users is maintained by J.Markert@gsi.de

Some rules to work with virgo.hpc.gsi.de:

  • Files are written to /lustre filesystem. Each Hades user owns a directory /lustre/hades/user/${USER} to  work on the farm (NO BACKUP!)
  • SLURM does not support other files systems than /lustre. batch
    scripts can not use user's home dir nor any other filesystem
  • The hadessoftware is distributed to the batchfarm via
    /cvmfs/hades.gsi.de/ (debian8) or /cvmfs/hadessoft.gsi.de (debian10)
    The /cvmfs/hades.gsi.de or /cvmfs/hadessoft.gsi.de are available on desktop machines dependening on the OS version
  • The batch jobs can be submited to the farm from the virgo.hpc.gsi.de
    cluster. This machines provides our software, the user ${HOME} and
    a file sytem mount to /lustre. You can compile and test (run) your
    programs here.
  • A set of example batch scripts for Pluto, UrQmd, HGeant, DSTs and user
    analysis you can retrieve from
    svn checkout https://subversion.gsi.de/hades/hydra2/trunk/scripts/batch/GE
    The folders contain sendScript_SL.sh+jobScript_SL.sh (SL).
    The general concept is to work with file lists as input to the sendScript, which
    takes care about the sync from user homedir to the submission dir on
    /lustre. The files in the list are splitted automatically to job arrays to
    minimize the load on the scheduler. The sendScript finally calls the sbatch
    command of SLURM to submit the job. The jobScript is the part which
    runs on the batch nodes.

Working on the batch farm of GSI:

virgo cluster documentation:

https://hpc.gsi.de/virgo/user-guide/storage.html


ssh key autentication

https://hpc.gsi.de/virgo/user-guide/access/key-authentication.html

 

Since 1st of September 2021 we have two software environments
available:

debian10: neweset software releases compiled with root 6.24.02 and gcc 8.3
         use this for current beam mar19 analysis of dst files. The login to
         the farm will use vae23.hpc.gsi.de (debian10,ROOT6) and no special action
         for handling the container environment with singularity is needed.
         It works as the not anymore vailable virgo-debian8.hpc.gsi.de
         
debian8: old software releases compiled with root 5.34.34 and gcc 4.9.2
        This covers all versions up to hydra2-5.6a used for dst productions
        of apr12,aug14,jul14, mar19 (gen5) pre 1st of September 2021
        use this to get exact same behaviour as before the switch and do
        not want to change any software. To make the environment work
        some handling of the singularity options are needed to start and
        submit jobs using this container environment. It will be described
        below.

#############################################################

 


virgo cluster usage information:
           virgo.hpc.gsi.de(bare bone to start your container), 
           vae23.gsi.de (debian10 container started at login)


Problems and tips:
  •  if your login to virgo.hpc.gsi.de
     does not work although you have/created ssh keys
     (https://hpc.gsi.de/virgo/access/key_authentication.html),
     try to cleanup .ssh/authorized_keys from other
     keys which need other credentials and might cause
     problems.
  •  ksh does not work with the login, 
     use bash (accounts-service@gsi.de).
     ksh is not installed at virgo and will
     lead to permission denied statements without further
     explanation. accounts-service@gsi.de will
     be responsable to change your login shell.
     It will take a while to see the changes, syncs
     are performed once per day at 22:00.
  •  X does not work at virgo at the moment :
     a. use lx-pool.gsi.de (or any desktop machine) to look at /lustre output
        use sshfs to mount /lustre to any linux machine which
        has no mount of /lustre already. 
        To mount /lustre us

    sshfs user@virgo.hpc.gsi.de:/lustre mymountpoint
    sshfs user@lustre.hpc.gsi.de:/lustre mymountpoint

         vae23.hpc.gsi.de will not work with sshfs
         since the machine will be closed after the sshfs command has been returned.

  •  our sendScript_SL.sh for batch submission before using singularity containers
     needs a small modification for virgo:

    # from inside virgo container
        command="--array=1-${stop} ${resources} -D ${submissiondir}  --      output=${pathoutputlog}/slurm-%A_%a.out -- ${jobscript} ${submissiondir}/${jobarrayFile} ${pathoutputlog} ${arrayoffset}"
       for virgo there is an additional " -- " between slurm options and 
        the user scripts+parameters of the script which should be started
        at the farm. This is needed since slurm seperates from the container
        which is started. The container version is choosen from the submit host
        automatically.
    #############################################################

 

debian10 container login:

ssh username@vae23.hpc.gsi.de            (virgo)

This command will start a container based instance
of gsi debian10.

 


debian8 container login:

working environment debian8:
1. login   ssh username@virgo.hpc.gsi.de.   
2. start debian8 container:   
 

 . start_debian8.sh
start_debian8.sh:
//-------------------------------
export SSHD_CONTAINER_DEFAULT="/cvmfs/vae.gsi.de/debian8/containers/user_container-production.sif"
export SSHD_CONTAINER_OPTIONS="--bind /etc/slurm,/var/run/munge,/var/spool/slurm,/var/lib/sss/pipes/nss,/cvmfs/vae.gsi.de,/cvmfs/hadessoft.gsi.de/install/debian8/install:/cvmfs/hades.gsi.de/install,/cvmfs/hadessoft.gsi.de/param:/cvmfs/hades.gsi.de/param,/cvmfs/hadessoft.gsi.de/install/debian8/oracle:/cvmfs/it.gsi.de/oracle"
shell=$(getent passwd $USER | cut -d : -f 7)
STARTUP_COMMAND=$(cat << EOF
   srun() { srun-nowrap --singularity-no-bind-defaults "\$@"; }
   sbatch() { sbatch-nowrap --singularity-no-bind-defaults "\$@"; }
   export -f srun sbatch
   $shell -l
EOF
)
export SINGULARITYENV_PS1="\u@\h:\w > "
export SINGULARITYENV_SLURM_SINGULARITY_CONTAINER=$SSHD_CONTAINER_DEFAULT
test -f /etc/motd && cat /etc/motd
echo Container launched: $(realpath $SSHD_CONTAINER_DEFAULT)
exec singularity exec $SSHD_CONTAINER_OPTIONS $SSHD_CONTAINER_DEFAULT $shell -c "$STARTUP_COMMAND"
//-------------------------------
 


This environment will allow to compile and use code. SL_mon.pl
and access the SLURM commands will not work on debian8.
virgo.hpc.gsi.de + vae23.hpc.gsi.de allow to use SL_mon.pl+SLURM


3. our sendScript_SL.sh for batch submission works
  for the vae23.hpc.gsi.de login. NO ADDITIONAL wrap.sh NEEDED
  (see below)!
------------------------------------------------------------


submission of batchjobs for debian8 virgo3:

1. login to virgo.hpc.gsi.de
2. modifiy your sendScript_SL.sh to use wrap.sh
  to start the debian8 container on the farm.

wrap.sh:
//-------------------------------
#!/bin/bash
jobscript=$1
jobarrayFile=$2
pathoutputlog=$3
arrayoffset=$4
singularity exec \
-B /cvmfs/hadessoft.gsi.de/install/debian8/install:/cvmfs/hades.gsi.de/install \
-B /cvmfs/hadessoft.gsi.de/param:/cvmfs/hades.gsi.de/param \
-B /cvmfs/hadessoft.gsi.de/install/debian8/oracle:/cvmfs/it.gsi.de/oracle \
-B /lustre \
/cvmfs/vae.gsi.de/debian8/containers/user_container-production.sif  ${jobscript} ${jobarrayFile} ${pathoutputlog} ${arrayoffset}
//-------------------------------


in sendScrip_SL.sh in the lower part where the sbatch command is build:
    
    #virgo bare bone submit using wrap.sh 
    wrap=./wrap.sh
    command="--array=1-${stop} ${resources} -D ${submissiondir}  --output=${pathoutputlog}/slurm-%A_%a.out -- ${wrap} ${jobscript} ${submissiondir}/${jobarrayFile} ${pathoutputlog} ${arrayoffset}"
 

SLURM tips:

The most relevant commands to work with SL:

sbatch   : sbatch  submits  a  batch script to SLURM.
squeue   : used to view job and job step information for jobs managed by SLURM.
scancel  : used to signal or cancel jobs, job arrays or job steps.
sinfo    : used to view partition and node information for a system running SLURM.
sreport  : used to generate reports of job usage and cluster utilization for
		   SLURM jobs saved to the SLURM Database.
scontrol : used  to  view  or modify Slurm configuration including: job, 
           job step, node, partition,reservation, and overall system configuration.
 
Examples:
squeue  -u  user             : show all jobs of user
squeue -t  R                 : show jobs in a certain state (PENDING (PD),
                               RUNNING (R), SUSPENDED (S),COMPLETING (CG),
                               COMPLETED (CD), CONFIGURING (CF),
                               CANCELLED (CA),FAILED (F), TIMEOUT (TO),
                               PREEMPTED (PR), BOOT_FAIL (BF) ,
                               NODE_FAIL (NF) and SPECIAL_EXIT (SE))
scancel  -u user             : cancel all jobs of user user
scancel jobid                : cancel job with jobid
scancel -t PD -u <username>  : cancel all pending jobs of a user
scontrol show job -d <jobid> : show detailed info about a job
scontrol hold <jobid>        : suspend a job
scontrol resume <jobid>      : resume a suspended job