sawyl | Of prologs and epilogs (Reply)

After months of talking about it, I've finally implemented a set of LoadLeveler prolog and epilog programs. In the process I've discovered:

any output generated by the prolog gets stamped on by the job itself
the epilog has to be run as the job user in order to update the output files
information about prolog or epilog problems is written to the StarterLog of the node where the master task executed
an easy solution to the epilog output problem is to write the program in perl or C an dup stdout and stderr to $LOADL_STEP_OUT and LOALD_STEP_ERR because this also redirects any child processes
not all LOADL variables are available on all nodes of a multi-node job (the important ones are usually only set on the master)
the programs are run by the job starter daemon and are unaffected by environment changes in the job script
that the precise details of writing prologs and epilogs doesn't seem to be terribly well documented

And, most importantly, that it doesn't matter if the epilog programs don't work but if the prolog doesn't work, the system will start to trash the jobs until the problems are fixed...