SJM - A Simple Job Manager

Overview

1. New
2. Introduction
3. Download
4. Quick Start
5. Example
6. Reference
7. Miscellaneous

New

SJM version V01.11 contains some bug fixes and enhancements:

SJM version V1.09 corrects a bug when printing out a warning message for jobs with log files that have not been updated for more than 1 hour (3600 sec.). Thanks to Olga Igonkina for reporting this bug.

SJM version V1.08 corrects a problem where SJMPrepareJobs crashes when the configuration file contains 'empty' lines with spaces or options that can not be identified. Otherwise V1.08 is identical with V1.07.

SJM version V1.07 includes means for automatic job monitoring and submission and the possibility to run jobs in gdb. See below for more information on these.

In addition V1.07 has improved job validation, where it now uses the same procedure as FjrCheckJob.

V1.07 is fully backward compatible to V1.0, i.e. you only need to replace the contents of the tar file into your workdirectory.
SJM V1.07 is compatible with python 2.2.3 and higher.

Some parts of this document still need to be updated. That doesn't mean that they are wrong, but rather that some new files or features might not be included in all parts of this document.

Introduction

The 'Simple Job Manager' (SJM) is a simplistic framework to manage jobs running in the BABAR framework. It is written in OO-Python, which should make it readable for you hackers out there, and it is fairly lightweight with only about 1000 lines of code (compared to the Task Manager which has more than 30,000 lines). So what does it do?

Download

SJM can be downloaded here as a tar file. 'cd' into your workdir and gunzip/tar/gtar whatever you feel like.

Quick Start

This assumes that you have a working framework application and that you run your application from the workdir in the test release (no fancy stuff here). The steps required to configure SJM are then to provide a short configuration file and a tcl snippet template file.

  1. Download the SJM tar file, e.g. SJM-V01-11.tar, and untar it. The tar file contains the following files:
    roethel@noric04> ls  SJMConfigFile.txt  SJMTestSnippet.tcl  sjm    
    Then copy sjm into a location where it can be found, e.g. ~/bin.
  2. Edit the example configuration file SJMConfigFile.txt to fit your analysis.
  3. Edit the example tcl snippet template SJMTestSnippet.tcl to fit your analysis (and preferably rename it).
  4. Set up the SJM directory tree and create jobs by running sjm prepare SJMConfigFile.txt in your workdir. This should create a subdirectory with the name of the SJM Project you defined in step 2.
  5. If everything worked, you should see a list of tcl files created by BbkDatasetTcl in the subdirectory <SJM Project Name>/tcl and a list of tcl snippets in <SJM Project Name>/prepared. You can check the existence of the jobs by running sjm show <SJM Project Name>
  6. Submit jobs with sjm submit --njobs 2 <SJM Project Name> (don't forget srtpath and condxxboot).
  7. When the jobs have completed (you can check that with sjm show again or by running bjobs) you can check them with sjm check <SJM Project Name>

A Simple Example

This example still uses the old command structure when SJM consisted of a set of separate executables and the main library SJMBase.py. The example is still valid, you just need to replace the old commands with the new ones, e.g. use sjm prepare instead of SJMPrepareJobs.

In the following I will show a simple test case for using SJM. You can't run this example line-by-line (since you can't write to my scratch space - I hope). But it is fairly straight forward to adopt this example for your own analysis.

I prepared a test release for a simple two photon analysis in ~roethel/analysis/analysis-21. My executable is called BetaMiniApp and the main tcl file that drives the analysis is in BetaMiniUser/GamGamTo4pi.tcl. I'm not skimming, but just writing out ntuples, which should all be written to /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/ntuple/ntuple_<ID>.root in scratch space, where <ID> should be the job-id (or job number) of the current job. The run directories (the directories the log files and jobreport files are written to) should be /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/<ID>. The input dataset for this analysis should be
users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1, and I only want to run (or better 'can run') 250,000 events per job.

Edit the Configuration File:

With that I can edit the configuration file... ok - the example configuration file is already setup for this example, what a coincidence.. However I want to change the name of this SJM Project to SJMPentaquarkRun1, so I edit the line

# define the name of the SJM   SJMName = SJMPentaquarkRun1     
The whole configuration file looks like this now. So on with the next step.

Edit the Tcl Snippet Template:

First, what is a tcl snippet template? The main tcl file - in my example GamGamTo4pi.tcl - provides the general configuration that should be used by all analysis jobs I want to run on in this context. However, I do have parameters like the names of my ntuples or names of my input collections that are different for every job and I need to pass them on to my general tcl file. It used to be common to define these parameters over environmental variables in the current unix shell, which is not a very good idea (and I don't want to go into this here). The better solution is to provide a short, job specific, tcl file, which only defines parameters (as tcl variables) particular for the current job and then itself sources the main tcl file which properly sets up the framework using these parameters.

In the context of a configurable job manager there is a little complication though, since it is not possible to anticipate what anyone would want to define in a tcl snippet. To get around this problem SJM (like the Task Manager) uses a tcl template and a set of 'tags' that act as placeholders for job specific information (for a list of tags see below). A user now can define any parameter in the tcl snippet template using these tags. When the jobs are created the tags get resolved and the actual tcl snippets are written out. If this is not totally clear yet, just follow the example and you will see how the job specific tcl snippets are created from the tcl snippet template.

The tcl snippet template in this example as defined in the configuration file is called SJMTestSnippet.tcl. I don't need to change anything in this tcl snippet template as it already is configured for this example. I like to point out the definition of rootName (the FwkCfgVar that defines the ntuple to be written out) and the last line that sources the main tcl file. For more details on tcl snippets see below. Also note that GamGamTo4pi.tcl uses these FwkCfgVars to setup the framework job, in particular the jobreport file and the ntuple name (there is really not much use defining variables if they are not used later on).

See here for the tcl snippet template file and here for the GamGamTo4pi.tcl file.

Create the Jobs:

roethel@noric04> SJMPrepareJobs SJMConfigFile.txt  Running BbkDatasetTcl --tcl 250000 --basename SJMPentaquarkRun1 \  --splitruns users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1 ...  BbkDatasetTcl: wrote SJMPentaquarkRun1-1.tcl (250000 events)  ...  BbkDatasetTcl: wrote SJMPentaquarkRun1-166.tcl (151236 events)  Selected 11 collections, 41401236/0 events, ~0.0/pb  done. Creating tcl snippets in directory 'prepared' now...  done!  

Running this command created a subdirectory called SJMPentaquarkRun1 in my workdir. Looking at this directory you can find six subdirectories, one for each job state (prepared, submitted, done, ok, failed) and one storing the tcl files defining the input collections that were created by BbkDatasetTcl. Listing these directories you find

roethel@noric04> ls SJMPentaquarkRun1/tcl  SJMPentaquarkRun1-1.tcl    SJMPentaquarkRun1-15.tcl   SJMPentaquarkRun1-50.tcl  SJMPentaquarkRun1-10.tcl   SJMPentaquarkRun1-150.tcl  SJMPentaquarkRun1-51.tcl  SJMPentaquarkRun1-100.tcl  SJMPentaquarkRun1-151.tcl  SJMPentaquarkRun1-52.tcl  ...  
roethel@noric04> ls SJMPentaquarkRun1/prepared  SJMPentaquarkRun1-0001.tcl  SJMPentaquarkRun1-0084.tcl  SJMPentaquarkRun1-0002.tcl  SJMPentaquarkRun1-0085.tcl  ...  

If you remember the discussion on tcl snippet templates, you may want to compare the resolved tcl snippet for e.g. the first job SJMPentaquarkRun1-0001.tcl with the tcl template file. Before continuing you may want to check if your snippets in the 'prepared' directory look ok. If not, fix the snippet template and/or the configuration file and try again (delete the subdirectory tree to remove the existing configuration - see below).

You can look at the job statistics with

roethel@noric04> SJMShowJobs SJMPentaquarkRun1         name            prepared submitted    done     failed      ok  ---------------------------------------------------------------------------   SJMPentaquarkRun1       166         0         0         0         0  

If you messed up when creating jobs, you could simply remove the subdirectory tree SJMPentaquarkRun1 (e.g. with rm -rf SJMPentaquarkRun1) and start over again. Now let's run some jobs...

Submitting Jobs

Let's test our configuration by submitting 2 jobs:

roethel@noric04> SJMSubmitJobs --njobs 2 SJMPentaquarkRun1  submitting jobs  Submitting job 1  Job <150864> is submitted to queue <kanga>.  bsub -q kanga -C 0 \  -o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \   /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh    Submitting job 2  Job <150865> is submitted to queue <kanga>.  bsub -q kanga -C 0 \  -o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \  /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh    Submitted 2 job(s).  

While preparing this example something interesting happened - the jobs crashed with (taken from the log file):

  ...    BetaMiniApp: error while loading shared libraries: libCore_pkgid_3.10-01.so: can  not open shared object file: No such file or directory    ...  

and STMShowJobs reported

roethel@noric04> SJMShowJobs SJMPentaquarkRun1  Job 1: Log file assumed done, job report file not found!  Job 2: Log file assumed done, job report file not found!         name            prepared submitted    done     failed      ok  ---------------------------------------------------------------------------   SJMPentaquarkRun1       164         0         2         0         0  

First the error message indicates that the log file satisfies the 'done'-conditions but the jobreport file for the job was not found (which is never a good sign). I fix this problem by submitting from a RH7.2 noric and want to resubmit the jobs, i.e. I need to move the jobs from the 'done' state back to the 'prepared' state. To do that I simply move the tcl snippet files for these jobs from the SJMPentaquarkRun1/done directory to the submitted directory:

roethel@noric04> ls SJMPentaquarkRun1/done  SJMPentaquarkRun1-0001.tcl  SJMPentaquarkRun1-0002.tcl  roethel@noric04> mv SJMPentaquarkRun1/done/*.tcl SJMPentaquarkRun1/prepared/.  
roethel@noric04> SJMShowJobs SJMPentaquarkRun1         name            prepared submitted    done     failed      ok  ---------------------------------------------------------------------------   SJMPentaquarkRun1       166         0         0         0         0  

We're ready to resubmit the jobs now :

roethel@noric14>  SJMSubmitJobs --njobs 2 SJMPentaquarkRun1  submitting jobs  Submitting job 1  Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1 exists   already. Cleaning up  Job <152456> is submitted to queue <kanga>.  bsub -q kanga -C 0 \  -o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \   /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh    Submitting job 2  Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2 exists   already. Cleaning up  Job <152458> is submitted to queue <kanga>.  bsub -q kanga -C 0 \  -o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \  /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh    Submitted 2 job(s).  

The warnings indicate that the run directories for the two jobs in question exist already, since I submitted these jobs before. The directories will be cleaned up so the output does not conflict.We can check if the jobs are really running

roethel@noric04> SJMShowJobs SJMPentaquarkRun1         name            prepared submitted    done     failed      ok  ---------------------------------------------------------------------------   SJMPentaquarkRun1       164         2         0         0         0  

And waiting a little more...

roethel@noric04> SJMShowJobs SJMPentaquarkRun1         name            prepared submitted    done     failed      ok  ---------------------------------------------------------------------------   SJMPentaquarkRun1       164         0         2         0         0  

We can now check the success of these jobs:

roethel@noric04> SJMCheckJobs SJMPentaquarkRun1  Checking jobs  ...updating job status  ...checking  job 1 ok.  job 2 ok.  Checked 2 jobs. Ok: 2   failed: 0  

All fine for me - I hope for you as well... have fun!

NEW: In addition to the just mentioned way of running jobs, SJM V1.06 and later supports running jobs in gdb (This only works where gdb is installed on the batch machines). To use this option run
> SJMSubmitJobs -g <SJMName>.

Using the Job Monitor

From V1.02 on SJM comes with the script/daemon sjm sprited (was SJMSprited) to automatically take on the management of jobs. This includes keeping a constant number of jobs in the queue, checking jobs that are done and, if requested, send an email in case of problems. As mentioned the script is designed to run as a daemon, i.e. it will continue to run even after you log off, but it has been mainly tested running in a terminal window. The configuration file takes the following parameters to configure sjm sprited:

Running automated job monitoring in the background adds some non-trivial complication to the simple job manager. The main (or better the only issue) is the possibility that two commands are attempting to do the same thing at the same time, e.g. sjm sprited is running sjm check in the background and you are running the same from the command line. That can lead to race conditions with unpredictable results (though the damage is pretty limited since after all the bookkeeping is done moving files within a unix file system. This is very safe and takes care of most of the possible race conditions which boil down to two processes trying to to things with the same file. However, you may see unusual error messages from SJM because an expected file all of the sudden does not exist). To avoid that a sophisticated lock mechanism was introduced, which prohibits two critical processes to run at the same time. Ok, ok - well, the sophisticated lock mechanism is simply a file called 'lock.pid', which resides in the SJM project directory and which contains the process id of the process which owns the lock. The lock should only be set when a process is attempting to move files and update job status, e.g. when running sjm submit, sjm show (with updating job status) and sjm check. Sometimes it can happen that a process did not remove the lock, either because it was killed before it finished (better not do that) or because the sjm sprited daemon process died, or... If a lock persists for a long time you should probably check the process id in the lock file and see if that process is still alive (you can do that by logging on the machine the process is running on and using > ps -p <process id>). If not it is safe to remove the lock file and proceed (i.e. > rm lock). In addition it is also advisable to check for a lock file when moving files from the prepared, submitted or done states (directories) to other states. However in practice one will move files from failed or possibly ok to prepared which is always safe.

sjm sprited maintains a log file which contains besides log information also the output of the various job submit and check operations. The output however is not flushed, so the order can be somewhat confusing. To start the daemon you just need to run
> sjm sprite --start <SJMName>
For further options see > sjm sprite -h. To just run sjm sprited in a terminal window (in which case the output is flushed and is better understandable) run
> sjm sprited <SJMName>

Finally - when all jobs have been submitted and checked sjm sprited will terminate by itself and optionally send an email notification.

.

Reference

SJM is the Task Manager with every feature removed that is not absolutely essential. The result was small enough to be written in two days and still do the work. The main idea behind SJM is that the tcl snippet for each job contains enough information to run a job and do some essential bookkeeping on it. The bookkeeping itself is managed over the particular directory structure in SJM.

SJM File- and Directory Structure and Bookkeeping

As mentioned the bookkeeping is managed over the directory structure and file names in SJM. The only parameter required to identify a job and resolve all its associated files and directories (for a given SJM project name) is the job id, which is stored over the tcl snippet name and input tcl file name convention

  <SJM project name>-<job id>.tcl    

(it is not a good idea to choose a SJM name that itself uses a '-<some number>' pattern since this may conflict with the job id extraction.). The other files SJM relies on (the log file, the job report file and the wrapper script) are all located in the run directory that is made up of the job id and (possibly) the SJM project name and is defined by the user in the configuration file.

The current job state is defined by the directory the snippet file is located in. At the beginning all snippet files are in the prepared directory. The command SJMShowJobs just counts the number of tcl files in any of these directories and displays the count. There is no other hidden behind-the-scenes bookkeeping. So just for fun you could move a snippet file from the prepared directory to any other job state directory and see how the output of SJMShowJobs changes (don't forget to move the file back again... and please use mv, don't cp the files!!!). As mentioned above, if you mess up (or don't like your setup) just delete the SJM directory structure and start over again.

The Run Directory

Every job managed by SJM has it's own run directory. This may seem a bit inconvenient, but it simplifies the management of jobs and makes it more flexible. Instead of having to keep track of different files individually, the only variable is the run directory itself and all other files (currently the log file, job report file and the wrapper script used to submit a job) can be identified using that.

The uniqueness of the run directory also requires one more thing - the user has to make sure that he/she defines a unique run directory when configuring a SJM project. The simplest way to do that is to make sure the <ID> tag is part of the run directory (The job creation should fail if this is not satisfied!).

The Configuration File

The parameters defined in the configuration file are:

The syntax used in the configuration file is of the form parameter = text, where parameter is not allowed to contain space characters (leading and trailing spaces will be removed). Text is provided as raw text, i.e. all spaces besides leading and trailing spaces will be preserved, so no quotes are necessary(!). Comments can be added by using the hash character '#' as the first character(!) in a line.

The Tcl Snippet Template File

The tcl snippet file and the use of tags were introduced in the example. Though the basic contents of the tcl snippet template is up to the user SJM requires the following lines:

  sourceFoundFile <INPUTTCL>  set jobReportName <JOBREPORT>  

SJM could automatically add these lines to the snippet file, but adding things behind the scenes that may interfere with other user defined settings may be more confusing then requiring certain values to be set. You also need to make sure your main tcl file contains the appropriate line to write out the jobreport file (see below).

Valid tags that can be used in the tcl snippet template are:

How to use FwkCfgVars is explained elsewhere (I don't know where) but essentially do the equivalent of the following in your main tcl file:

   FwkCfgVar jobReportName     FwkCfgVar rootName     ...     jobReport filename $jobReportName    

The Commands

With the two new additions there are now six commands altogether. These are simple commands and don't take a lot of command line options, but they all, with the exception of SJMSprited, do have a basic -h, --help option to remind you of all the options that they (don't) have.

sjm prepare (was SJMPrepareJobs)

sjm prepare actually does three different things:

First it reads in the configuration file, verifies some entries and creates the SJM directory structure in the current workdir. It also copies the configuration file to the subdirectory where it serves as the main configuration file for all the other SJM commands. For reference a copy of the tcl snippet is also stored in the SJM directory.

The second thing sjm prepare does in to run BbkDatasetTcl to create a list of input tcl files in the tcl subdirectory.

Finally sjm prepare reads in the list of input tcl files and creates the tcl snippets for every input tcl file in the prepared subdirectory.

sjm show (was SJMShowJobs)

sjm show basically just counts tcl snippet files in the different job status subdirectories. However, before doing so it checks if jobs listed in the submitted directory have finished. If a job is still running or is assumed to be finished, is determined by checking the last 200 lines of the log file for a given string that signals that the job has completed. The default is to look for 'Resource usage summary', but this can be overridden defining JobFinishedString in the configuration file (e.g. for running at sites using PBS). In addition the existence of the job report file and the stop time written in the job report file is also checked and a warning is printed if the log file has been found to indicate a finished job but the job report file does not mirror this.

sjm submit (was SJMSubmitJobs)

sjm submit submits jobs. It first creates the run directory (and cleans up old run directories if these happen to exist already) and then creates a small shell wrapper script in that directory. The wrapper script, which is necessary to provide compatibility with other job schedulers like PBS, is then submitted to the batch queue using the command defined in the configuration file.

sjm check (was SJMCheckJobs)

This command finally checks jobs that are found to be done. Similar to identifying completed jobs, jobs are checked ok if the exit code of the job running in the queue was found to be 0 (Note that the shell wrapper 'exec's the framework executables for this purpose instead of running the framework executable as a sub process!). In LSF this can be done by parsing the first and last 200 lines of the log file for the string 'Successfully completed', which is the default in SJM. This default can be overridden by defining JobSuccessfulString in the configuration file. In addition, the job report file must exists and must contain the stop time.

sjm sprite (was SJMSprite)

sjm sprite is just used to start and stop the job monitoring script sjm sprited when that is run as daemon. To keep the daemon running in an afs environment as e.g. at slac, you need to run klog -setpag to assure to have valid token after closing the terminal window.

sjm sprited (was SJMSprited)

sjm sprited is the job monitoring daemon - when run as daemon. But it can be run just as well in a terminal window. There are no commandline options for this command, except for the SJMName itself. For most parts sjm sprited sleeps (the default sleep time is 20 minutes). When it awakes it first updates the job status, then determines how many jobs are currently in the queue (it uses SJM's own bookkeeping for this and does not rely on a specific batch interface and is therefore also not sensitive to occasional outages of the batch system) and submits the neccessary number of jobs to keep the requested number of jobs in the queue. Finally it checks done jobs. When nothing is left to be done sjm sprited exits.

Running in PBS (or other Job Schedulers)

Since SJM uses a wrapper script to submit jobs, running on PBS (or yet another job scheduler) is not a big problem. However you have to provide the necessary strings in the configuration file (see 1. in the list above) to identify in a log file to determine if a job has finished and has run successfully.

SJM tries to identify idle jobs by checking the time of the last update of a log file. If the update occured more than an hour before the check a warning message will be printed, but no further action will be taken. It is up to the users to check the status of the job and (possibly) fix the problem. PBS typically writes log files to a private area and only renames the log file to the defined log file name when the job has finished. Therefore this additional check is not possible (Redirecting the output over the shell wrapper is not a good alternative, since this will not capture the report from the job scheduler containing the job exit code).

A short description on how to run at RAL will follow in a little. However the only thing needed to configure is the batch command and the string in the log file to identify that a job is done. (Actually at RAL this can be anything because the log file only exists in the global readable area when the job is done. Job validation is exclusively done using the job report file, not the log file.

Miscellaneous

What is SJM

SJM was born in the need to run some analysis while I'm still writing on the Task Manager. It should not replace the Task Manager though - if you are looking for a full production type analysis framework which allows e.g. merges, imports of collections to the bookkeeping database/hpss, full bookkeeping of runs etc., SJM is not the tool to use. However, if you just want to run some jobs with a given input dataset, and don't really care about all the additional feature, then the heavy-weight Task Manager will not be the ideal tool to use and you are better off with a simple tool as SJM... I would guess that the majority of analysis jobs will fall into the latter category...

Distribution

SJM is distributed as a tar file and not in a package in cvs. The reason is that a cvs package needs a maintainer and I don't have the time to maintain SJM. If someone volunteers to takeover this job, I can unwrap the SJMBase.py file and put SJM in a package.

Bug Reports/Disclaimer

SJM is provided as is. It probably has bugs. You can send me a mail with bug reports and I will try to fix them whenever I have time.