Description

RJM is a set of BASH scripts that allow multiple hosts sharing an NFS directory to process inter-dependent jobs while being robust to reboots and outages. When you set up a job, RJM daemons on various hosts will try to steal it and carry it out. If a host is interrupted temporarily, it will restart the jobs it was working on when it comes back up. If a host goes down indefinitely, other hosts will miss its heartbeat, and try to restart its interrupted jobs.

Any job can be set up to block until other jobs have completed. No dependent job will start until all of its dependencies have definitely completed successfully. Failures leave the job in a state suitable for diagnosis and later restart.

Jobs can have different priorities, and hosts can be limited to perform specific kinds of job and specific priorities. Setting up a job is idempotent, so it can be retried in the event of a failure.

RJM was created to perform back-end transcoding and EDL preparation for Storisphere.

Installation

svn export http://scc-forge.lancaster.ac.uk/svn-repos/misc/rjm/branches/public rjm
cd rjm
make
sudo make install

Files are installed in PREFIX, which is /usr/local by default.

Use

Configuration and preparation

RJM commands read BASH commands from ~/.rjm/conf.sh to get local configuration. A second file ~/.rjm/tasks.sh is also read just before each job is executed. Job state is stored in ~/.rjm/jobs/, which you can inspect directly, or with rjmls.

Each job has a type, which corresponds to a command task_type, which is easiest to set up as a BASH function in ~/.rjm/tasks.sh. Alternatively, ~/.rjm/tasks.sh could set up PATH to point to the commands that implement the various job types.

If ~/.rjm/conf.sh is not found, RJM will look for PREFIX/etc/rjm/conf.sh, /usr/local/etc/rjm/conf.sh, /usr/etc/rjm/conf.sh, and finally /etc/rjm/conf.sh. By default, tasks.sh will be read from the same location, but you can override this by setting ${rjm_taskconf} in conf.sh. The location of the job-state directory is always ~/.rjm/jobs/ (regardless of where conf.sh was found) unless overridden by setting rjm_jobdir. If you want to run jobs across multiple hosts, point ${rjm_jobdir} at an NFS-mounted directory accessible by all hosts.

A job's working directory is ${rjm_wd}, which is /tmp by default.

Setting up jobs

You must choose a type and a nonce for each job, which are used to form the job's id. The type and nonce must not contain dots (U+002E). For example, a job with id billing.simpsons-8761230871234 has the type billing and the nonce simpsons-8761230871234. The type means that RJM will execute the job by invoking task_billing.

Set up a job with rjmsetup, giving one argument which will be the job id. The command reads the job's configuration from standard input, in the form of BASH variable assignments. for example:

$ rjmsetup billing.simpsons-8761230871234
userid=simpsons
invoice=8761230871234
^D

These variables will be available to your task_type command when the job executes. You can use almost any variable names you want, but avoid those beginning with rjm_.

Normally, you wouldn't set up a job directly from the command line, but in a script instead. For example:

(
  printf 'userid="%s"\n' "$user"
  printf 'order="%s"\n' "$orderno"
) | rjmsetup "billing.$user-$orderno"

A job will not start unless it has been released. Release a job with rjmgo:

rjmgo billing.simpsons-8761230871234

Each job has a priority, which is just an alphanumeric string, with a being a higher priority than b, for example. The default is n. Specify a priority with rjmsetup -p prio.

Inter-dependent jobs

Jobs can depend on each other, i.e., one job uses the output of another, completed job, so it must wait until that job completes successfully before it starts. RJM can cope with any discrete acyclic directed graph, with jobs as vertices, and dependencies as edges. An edge runs from the parent job to the child job, indicating that the child depends on the parent completing first. A job may have any number of parents and children. The roots of the graph are jobs that have no parents. The leaves of the graph are jobs with no children.

Graphs must be built from the leaves up, with one rjmsetup command per job. You always set up the child before the parent, as the job's children must be listed in its configuration. For example, the following configures the job to be a parent of shipping.simpsons-8761230871234 and notify.simpsons-8761230871234:

userid=simpsons
invoice=8761230871234
rjm_blocks=()
rjm_blocks=("${rjm_blocks[@]}" shipping.simpsons-8761230871234)
rjm_blocks=("${rjm_blocks[@]}" notify.simpsons-8761230871234)

So, if this is the configuration for billing.simpsons-8761230871234, shipping.simpsons-8761230871234 and notify.simpsons-8761230871234 will not start until billing.simpsons-8761230871234 succeeds.

You cannot reliably make a new job the parent of an already released job, because it might have already started before you can add the dependency, so only specify unreleased jobs in the rjm_blocks array. Also, releasing a job automatically releases all its children recursively. This normally means that you have to build the entire graph before releasing the roots.

However, if you anticipate that another parent is required, but you don't have enough details to configure it yet, you could specify a dummy parent and keep it unreleased. When you know how to create the real parent, make it a parent of the dummy, and release it then (thus releasing the dummy).

It's also possible to create a whole new graph from within a job. This approach might be appropriate when, for example, the nonces of jobs in the new graph are only determined during the creating job.

Cleaning up

You can list files to be deleted with the variable rjm_delete. For example, you might want to delete the inputs or intermediate results of a job.

Listed files will only be deleted if the job completes successfully. If it fails, they remain to assist diagnosis and restart.

For example, suppose that your job operates on the file billing.simpsons-8761230871234.in. If it can be discarded on completion, configure the job with:

userid=simpsons
invoice=8761230871234
rjm_delete=(billing.simpsons-8761230871234.in)

Sharing results

The output of one job might be a file to be used by several child jobs. You can't afford to delete it until all children are finished with it, but the parent job won't be around to determine when that has happened. Nor are any of the children in a position to determine that their siblings have finished with it.

In this case, you could create hard links to the file, one per child, using a name specific to each child. Then list the original file in the rjm_delete array of the parent, and list each alias in the corresponding arrays of the children. The file will be properly deleted only when the parent and all children have actually finished with the file. You can do normal copies too, but hard links won't use up much additional space. You might also consider using cp --reflink=auto, which has similar space-saving abilities.

For example, suppose that your job generates billing.simpsons-8761230871234.out, and that is to be used by the shipping and notify jobs. Tell your job to hard-link a specified list of aliases to the file, and indicate that the original is to be deleted on completion:

userid=simpsons
invoice=8761230871234
copies=()
copies=("${copies[@]}" shipping.simpsons-8761230871234.in)
copies=("${copies[@]}" notify.simpsons-8761230871234.in)
rjm_blocks=()
rjm_blocks=("${rjm_blocks[@]}" shipping.simpsons-8761230871234)
rjm_blocks=("${rjm_blocks[@]}" notify.simpsons-8761230871234)
rjm_delete=(billing.simpsons-8761230871234.out)

Here, we assume that the job type recognizes copies as an array of filenames to hard-link to its output.

Operation

You keep everything running with a few cronjobs:

# One each of these
*/3 * * * * /usr/local/sbin/rjmd > /dev/null 2>&1
0   0 * * * /usr/local/bin/rjmflush > /dev/null 2>&1

# One of these for each active task
*/4 * * * * /usr/local/sbin/rjmproc -i id1 > /dev/null 2>&1
*/4 * * * * /usr/local/sbin/rjmproc -i id2 > /dev/null 2>&1
*/4 * * * * /usr/local/sbin/rjmproc -i id3 > /dev/null 2>&1

Each rjmproc processes one job at a time, and each needs an identifier unique within the host it's running on. You can restrict it to certain job types with -t regexp, and to certain priorities with -p regexp, where regexp is a regular expression compatible with egrep.

rjmd maintains a heartbeat, and prevents rjmprocs from starting until it's ready. It detects when a reboot has just happened, and resets any incomplete jobs running on this host. rjmflush just clears out the records for jobs that completed several days ago.

rjmls can be used to monitor the state of jobs. By default, it lists all jobs in states wait, ready, run or failed. To include other states, use -s state. To restrict to a job type, use -t type; use it multiple times to list multiple types.

Error handling

If a job fails, its files will hang around so you can find out what's wrong. This command will print the exit code of a job:

rjmexit billing.simpsons-8761230871234

You can also use -q to stop it from printing the code, and instead exit with it. -w causes it to wait until the job has stopped, checking every couple of seconds.

The following command prints the standard output of a job:

rjmout billing.simpsons-8761230871234

Use -e to get the error output instead. Use -t to follow the output as it's being generated, with tail -f.

How it works

Job state

Job state is kept in $rjm_jobdir/. A job's configuration is found in job.type.nonce. This file remains unchanged during the lifetime of the job.

A newly set-up job is represented by an empty file wait.type.nonce. rjmgo renames this to ready.type.nonce. To claim/steal a job, rjmproc renames ready… to run.type.nonce.host, where host is an identifier derived from the hostname by default. If the job completes successfully, run… is renamed to done.type.nonce. Alternatively, on a non-zero exit status of rc, run… is renamed to failed.type.nonce.rc.

To clean up, either rjmproc or rjmd will rename done… to dusting…. Then the files indicated by rjm_delete will be removed, before dusting… is renamed to old….

A job's priority prio is stored in the name of an empty file, prio.type.nonce.prio.

Reboot detection

Each host must run one instance of rjmd, usually with crontab. Other invocations of rjmd on the same host will detect that it is already running, and quit.

A successful invocation assumes that the system has recently rebooted. It will first wait until the job state directory $rjm_jobdir exists, as this acts as the main communication channel between jobs on different hosts.

The next step is to reset the removal of files used by completed jobs. All files called dusting… are renamed to done…. These should be clean-ups that were interrupted by the reboot, and renaming the files will trigger the clean-ups to be done again. They are idempotent, so it shouldn't matter if some were already partially or fully complete.

The next step is to check if any jobs appear to be running on this host. There should be none at this stage, as rjmproc processes should be blocked so far. This means that any files of the form run.type.nonce.host must be jobs that were running on this host, but were interrupted before completion by the reboot. These are renamed back to ready.type.nonce, which means that any rjmproc, even on another host, could start working on them.

Now rjmproc processes can safely begin claiming jobs. rjmd signals this readiness by touching $rjm_tmpdir/rjm.flag.$rjm_flag, which is normally a local file, i.e., not visible to other hosts. ($rjm_flag is default by default.) rjmproc processes do not start any work unless this file is present.

Finally, rjmd periodically cleans up done… jobs, touches $rjm_jobdir/beat.$rjm_hostid (which is visible to other hosts), and looks for the same files on other hosts that haven't been touched for a long time. These are assumed to be dead hosts that might have claimed some jobs, and then been switched off. Files matching run.type.nonce.host indicate those incomplete jobs, which get reset back to ready….

Dependencies

For every dependency, a file $rjm_jobdir/block.type1.id1.type2.id2 is created, where 1 refers to the child job, and 2 refers to the parent. When the parent completes successfully, all files matching block.*.*.type2.id2 are deleted. Before moving a job from ready to run, no files matching block.type1.id1.*.* may exist, as their presence indicates that some parent is still running.

Optimizations

To relieve the burden of searching for ready… files that build up as rjmprocs become saturated, a ready… is only checked if a corresponding try… file can be removed without error. This file is created whenever a job might have just become ready to start, i.e., it has just been moved (back) to ready…, or one of its parent jobs has just completed. Being able to remove it without error indicates that it hasn't been checked since some part of the conditions that prevent it from running were last changed.

Software updates

Once they have actually got running, rjmd and rjmproc will keep testing to see if they have been replaced. They do this by comparing the timestamps of their own executable scripts with $rjm_tmpdir/rjm.flag.$rjm_flag. If that file is older, the scripts must have been replaced, and so they exec themselves with almost identical arguments.