Pet Peeves: run-parts(8), asynchronous execution, and logging


I run Slackware on two of my systems. I have always admired Slackware's simplicity and honesty. Some people accuse it of being overly complex, but I've found that what others see as complexity, I see as the ability to control and fine-tune. Other distributions may be great for those who are looking for a Unix system with a Microsoft feel, but I prefer my OS to not hide it's controls, to not tell me that, for my own good, it won't do what I want it to do.

Slackware's run-parts(8) is one of those simple tools that works well. Built from a simple /bin/sh script, and using basic Linux tools, Slackware's run-parts does the basic work of "batch" processing for cron. But, this simplicity has it's price; target scripts are run synchronously, one after the other. If one get's stuck, the remainder can't run. And, there's no easy way to determine if a script has got "stuck"; there's no conveniently accessable logging of the start or termination of each script, nor is there any consistent way to gather messages from the scripts into one place.

And, so, my pet peeve: run-parts(8) doesn't do everything that I need in a batch processor. So, I went and built my own add-on solution.

How it started

Originally, I had a number of scripts that performed backups on specific directories. Each script would ensure that a specific USB-attached hard drive was mounted, create it's backup on that hard drive, and terminate. There were scripts for daily backups, scripts for weekly backups, and scripts for monthly backups. They seemed to ideally fit within the /etc/cron.daily, /etc/cron.weekly and /etc/cron.monthly directories parsed by Slackware's root crontab using run-parts(8).

But, there were some hiccups.

  • because of the synchronous design of run-parts(8), each script had to complete before the next one would start, and some of these scripts took a long time to complete.
  • run-parts(8) had no convenient way to indicate which script it was currently waiting on; you either receive an email containing all the script startup notifications, once the last script has completed, or you receive no email at all. When things ran long, there was no easy way to determine what had been done, and what was waiting to be done
  • I wanted the USB hard drive to remain unmounted until needed for the backups. Because the daily, weekly, and monthly scripts started at different times, the USB hard drive had to be mounted independently of, and before the first script was started.
  • I wanted the USB hard drive to be unmounted once all the backups were complete. Because the daily, weekly, and monthly scripts ended at different times, the USB hard drive had to be unmounted independently of, and after the last script had completed

To make the scripts asynchronous, I added wrapper logic to each script, so that the working part of the script ran in the background, while the wrapper terminated and let run-parts(8) proceed to the next script. And so, the scripts got complex.

Additionally, I put lots of logging logic (using logger(1) to log to syslog) into the scripts. This helped me track the execution of each script, with the assistance of grep and awk. But, this logging was inconsistent in format, and did nothing to solve the synchronicity and resource management problems, and the scripts got even more complex.

Finally, to solve the resource dependency issue (the mounting and unmounting of the USB drive), I added even more common logic to each script, this time to determine if the drive was mounted, and mount it if not, and (for the last script to finish) to determine that the drive was no longer needed, and to unmount and fsck it. And the scripts got even more complex.

So, my simple backup scripts, which had just started out as a bunch of tar(1) commands, now contained logic to lock sentinal files, mount drives, log to syslog, daemonize, umount drives, and unlock sentinal files, as well as run individual tar(1) commands.

Gawd, what a mess. There had to be a better way.

The genesis of RMonitor

I have no qualms about, for my own Slackware installations, enhancing the scripts and tools that Pat V. supplies in the Slackware distribution package. I did take a brief look at the run-parts(8) script to see how I could enhance it, but quickly abandoned that as a bad idea. run-parts(8) is simple, works well, and is crucial to a healthy system. You don't fsck with the things that work, when you can just add stuff to make it work better.

So, I decided to take all my custom script code; the syslogging of start and finish, the asynchonous execution, the resource manipulation and locking; and built it all into a program that would handle everything for me. A local program that could be used in conjunction with run-parts(8), to give me the features that I was looking for.

What I came up with is an "interpreter" that you use as the hashbang of a script. This interpreter (called rmonitor) launches the "real" script in an asynchronous child process. The child takes care of manipulating the resource management and sentinal/lock files to ensure that named resources are made available before the first target runs, and cleaned up after the last target completes. It also generates, to syslog, a fixed-format set of messages indicating the start and completion (with exit status) of each target process run, and intercepts any stdout and/or stderr text written by the target scripts, and logs that to syslog in a standard format.

Now, each script, that embeds this interpreter as it's hashbang, runs asynchronously, has standard start and termination messages logged to syslog in a consistent and identifiable format, has it's own informational and error messages automatically logged to syslog in a consistent and identifiable format, and manipulates required resources on demand. My run-parts(8) target scripts are now back to the simple scripts they were, taking care of just the things that they need to take care of.

RMonitor Block Overview


>-----,
       \         ,-->{STDERR}
        |       ,--->{SYSLOG}
        v      /
   +----------+
   | rmonitor |
   |  gparent |- - - - ->+----------+
   +----------+          | rmonitor |- - - - ->+----------+
        |                |  parent  |          |   PREP   |+
        |                |  ~ ~ ~   |          |   task   ||+
       /                 |          |<- - - - -+----------+||
 <----'                  |          |           +----------+|
                         |          |            +----------+
                         |          |
                         |          |- - - - ->+----------+
                         |          |          |  target  |
                         |  ~ ~ ~   |          |   task   |
                         |          |<- - - - -+----------+
                         |          |
                         |          |- - - - ->+----------+
                         |  ~ ~ ~   |          |   POST   |+
                         |          |          |   task   ||+
                         |          |<- - - - -+----------+||
                         +----------+           +----------+|
                              |      \           +----------+
                              |       \
                             /         '-->{SYSLOG}
 <--------------------------'

RMonitor Manpage

RMONITOR(1)                                                        RMONITOR(1)

NAME
       rmonitor - run script with on-demand pre and post processing

SYNOPSIS
       rmonitor resourcename[,resourcename...]  [/path/to/interpreter] /path/to/script

DESCRIPTION
       rmonitor is intended to be used as the binary named by a script's hash-
       bang  interpreter line.  A resourcename names a single resource that
       must be prepared for use before the target script runs, and cleaned  up
       after  the  target  script completes. A resource list consists of one or
       more resourcenames, separated by commas, permitting a script to involve
       one or more resources.

       The  resourcename  is  used to tie a number of different target scripts
       together, in that every script that uses a specific  resourcename  uses
       the  same specific resources.  rmonitor uses the resourcename to select
       a specific helper script to perform resource management  prior  to  the
       first use, and following the last use. Typical resource management pro-
       cesses would mount external storage before the first process writes  to
       it, and unmount it after the last process finishes with it.

       The  optional /path/to/interpreter names the script interpreter program
       to be used to execute the target script in which rmonitor is  embedded.
       If  the  hashbang  omits the interpreter path, rmonitor uses /bin/sh as
       the target interpreter program.

OPTIONS
       As a hashbang, the rmonitor commandline will take the format
         #!/path/to/rmonitor resourcelist /path/to/interpreter
       or
         #!/path/to/rmonitor resourcelist

       Where
       - resourcelist is a list of resourcename values, separated by commas
          Example: resourcename1,resourcename2,resourcename3
               or: resourcename

       - resourcename is a 7 to 20 character long name consisting only of
         upper or lower case alphabetics, digits, dash and/or underscore

       - /path/to/interpreter is an optional  path  to  a  script  interpreter
       (defaults to /bin/sh)

FILES
       /var/lock/rmonitor.*.lock  acts as both a serialization lockfile, and a
       semaphore file for the named  resource.   Each  resource  will  have  a
       unique  lock/semaphore file. This file is built and managed by rmonitor
       as it executes.

       /usr/local/lib/rmonitor/*.exit provides the PREProcessing and POSTpost-
       processing  logic  for the named resource.  These executable files will
       accept a single argument which will be set to "PREP" for preprocessing,
       and "POST" for postprocessing. The executable will use this argument to
       determine and execute the correct processing.

       The following is a trivial example exit

         #!/bin/sh
         echo ::: $0 Begins :::
         RC=0
         case $1 in
           PREP)
             echo ::: PREProcessing exit invoked :::
             # perform resource allocation here
             ;;
           POST)
             echo ::: POSTprocessing exit invoked :::
             # perform resource deallocation here
             ;;
           *)
             echo ::: $1 option invalid ::: >&2
             RC=1
             ;;
         esac
         echo ::: $0 Terminates :::
         exit $RC

       /bin/sh is used as the default interpreter should none  be  specified
       in commandline arguments.

DIAGNOSTICS
       rmonitor   emits   diagnostic   messages   to   stderr   and/or  syslog
       (LOG_LOCAL1).  The program logs according to it's phase of operation.

       The grandparent process, parses commandline arguments and prepares  the
       internal  processing  controls.   While  validating the commandline, it
       logs only to stderr. After validation, the  grandparent  logs  to  both
       stderr and syslog. Under normal use, all grandparent messages will show
       up in the stderr log (i.e. email from cron) of the initiating script.

       The parent process logs all diagnostic  messages  to  syslog.  Informa-
       tional  messages,  including  parent  generated messages, preprocessing
       stdout, target  stdout,  and  postprocessing  stdout  are  directed  to
       LOG_INFO,  while critical messages including parent generated messages,
       preprocessing stderr, target  stderr,  and  postprocessing  stderr  are
       directed to LOG_ERR.

       Diagnostics  will  contain  values  derived  from  execution parameters
       and/or script-generated data
        - <PgmName> is the name of the rmonitor binary, stripped of it's path
        - </Path/To/JobBase/JobScript> is the path to the hashbanged script
        - <JobBase> is the name of  the  directory  in  which  the  hashbanged
       script is found
        -  <JobScript> is the name of the hashbanged script within the JobBase
       directory
        - <Resource> is the name of one of the resources listed in  the  hash-
       banged script
        - <Interpreter> is the interpreter specified by the hashbanged script
        -  <Pid> is the process id of the rmonitor child process for the hash-
       banged script
        - <text> is arbitrary text generated by one of the invoked scripts

Example Syslogs

    Mar  9 04:40:03 merlin cron.daily: JOB 21.daily.backup.etc[10200] Launched
    Mar  9 04:40:03 merlin cron.daily: JOB 21.daily.backup.htdocs[10226] Launched
    Mar  9 04:40:03 merlin cron.daily: JOB 21.daily.backup.mysql[10246] Launched
    Mar  9 04:40:03 merlin cron.daily: JOB 21.daily.backup.userdirs[10266] Launched
    Mar  9 04:40:03 merlin 21.daily.backup.etc[10200]: JOB Begins
    Mar  9 04:40:03 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: EXIT Begins
    Mar  9 04:40:03 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: Will fsck filesystem and mount to /var/backup
    Mar  9 04:40:03 merlin 21.daily.backup.htdocs[10226]: JOB Begins
    Mar  9 04:40:03 merlin 21.daily.backup.mysql[10246]: JOB Begins
    Mar  9 04:40:03 merlin 21.daily.backup.userdirs[10266]: JOB Begins
    Mar  9 04:48:29 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: usb_backup: 9233/39075840 files (3.9% non-contiguous), 31231875/78142160 blocks
    Mar  9 04:48:31 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: /sbin/fsck returned 1
    Mar  9 04:48:31 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: Filesystem     1K-blocks      Used Available Use% Mounted on
    Mar  9 04:48:31 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: /dev/sdh1      307663800 120022660 172012708  42% /var/backup
    Mar  9 04:48:31 merlin 21.daily.backup.etc[10200]: PREP UsbBackup: EXIT Status  0
    Mar  9 04:48:31 merlin 21.daily.backup.htdocs[10226]: PROCESSING Begins
    Mar  9 04:48:31 merlin 21.daily.backup.etc[10200]: PROCESSING Begins
    Mar  9 04:48:31 merlin 21.daily.backup.mysql[10246]: PROCESSING Begins
    Mar  9 04:48:31 merlin 21.daily.backup.userdirs[10266]: PROCESSING Begins
    Mar  9 04:48:52 merlin 21.daily.backup.mysql[10246]: PROCESSING Status 0
    Mar  9 04:48:52 merlin 21.daily.backup.mysql[10246]: JOB Status 0
    Mar  9 04:48:54 merlin 21.daily.backup.etc[10200]: PROCESSING Status 0
    Mar  9 04:48:54 merlin 21.daily.backup.etc[10200]: JOB Status 0
    Mar  9 04:49:29 merlin 21.daily.backup.htdocs[10226]: PROCESSING Status 0
    Mar  9 04:49:29 merlin 21.daily.backup.htdocs[10226]: JOB Status 0
    Mar  9 05:09:39 merlin 21.daily.backup.userdirs[10266]: PROCESSING Status 0
    Mar  9 05:09:39 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: EXIT Begins
    Mar  9 05:09:39 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: Will umount and fsck /var/backup
    Mar  9 05:09:39 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: Filesystem     1K-blocks      Used Available Use% Mounted on
    Mar  9 05:09:39 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: /dev/sdh1      307663800 120023924 172011444  42% /var/backup
    Mar  9 05:18:00 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: usb_backup: 9233/39075840 files (3.9% non-contiguous), 31232191/78142160 blocks
    Mar  9 05:18:01 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: /sbin/fsck returned 1
    Mar  9 05:18:01 merlin 21.daily.backup.userdirs[10266]: POST UsbBackup: EXIT Status  0
    Mar  9 05:18:01 merlin 21.daily.backup.userdirs[10266]: JOB Status 0
About: 
System Management: 

Comments

I'm working on rmonitor v2.0.1, to satisfy a request for an alteration of the syslog ID used by the rmonitor "parent" process. With v2.0.0, rmonitor logs all messages from, or intercepted by, the parent, with an ID that identifies the target script. I have a request to enhance this ID so that it selectively includes the name of the monitor program as well. In this way, "the rmanager job management is easily distinguished from the processes of the script itself".

This means that syslog lines that currently look like

Mar 9 04:40:03 merlin cron.daily: JOB 21.daily.backup.htdocs[10226] Launched
Mar 9 04:40:03 merlin 21.daily.backup.htdocs[10226]: JOB Begins
Mar 9 04:48:31 merlin 21.daily.backup.htdocs[10226]: PROCESSING Begins
Mar 9 04:48:31 merlin 21.daily.backup.htdocs[10226]: message from script
Mar 9 04:49:29 merlin 21.daily.backup.htdocs[10226]: PROCESSING Status 0
Mar 9 04:49:29 merlin 21.daily.backup.htdocs[10226]: JOB Status 0

would, with a commandline option, look like

Mar 9 04:40:03 merlin cron.daily: JOB 21.daily.backup.htdocs[10226] Launched
Mar 9 04:40:03 merlin rmonitor: 21.daily.backup.htdocs[10226]: JOB Begins
Mar 9 04:48:31 merlin rmonitor: 21.daily.backup.htdocs[10226]: PROCESSING Begins
Mar 9 04:48:31 merlin 21.daily.backup.htdocs[10226]: message from script
Mar 9 04:49:29 merlin rmonitor: 21.daily.backup.htdocs[10226]: PROCESSING Status 0
Mar 9 04:49:29 merlin rmonitor: 21.daily.backup.htdocs[10226]: JOB Status 0

It's not a difficult change, but it means that I have to refactor the commandline parsing logic (because it was complicated, and not conducive to change), and add in an option to select this "Long Name" logging.

I've got the coding complete, and an initial test looks good, but I'm going to try it out on my systems for a while before releasing the code into the wild.

OK, I've taken some time, and tested the revisions to RMonitor v2.0.0 that provide an alternate logging identifier. Everything seems to work well, so I've added the v2.0.1 tarball to my original posting.

Please note that the v2.0.1 codebase is the one I will go forward with.

To activate the alternate logging identifier, include a "-L" option before the resourcelist.
I.e.

Without alternate logging identifier

#!/usr/local/bin/rmonitor resource1,resource2,resource3 /bin/bash
With alternate logging identifier

#!/usr/local/bin/rmonitor -L resource1,resource2,resource3 /bin/bash

That's all there is to it.

Here's a simple script to provide simple "management reporting" of the rmonitor jobstream:


(zcat /var/log/archive/messages.1.gz ; cat /var/log/messages)                 |
  awk '$6 == "JOB" && $8 == "Launched" && NF == 8 {print $7, $1, $2, $3 }'    |
  while read jobname mon day time
  do
    echo "JOB $jobname launched $mon $day $time"
    (zcat /var/log/archive/messages.1.gz ; cat /var/log/messages)             |
      awk -v jobname=${jobname}":" '$5 == jobname { print }'
    echo
  done

The output looks something like this:


JOB 00.daily.clean.usb_media[22061] launched Mar 29 04:40:01
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: JOB Begins
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PREP UsbMedia: EXIT Begins
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PREP UsbMedia: Filesystem     1K-blocks      Used Available Use% Mounted on
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PREP UsbMedia: /dev/sdb1      721075720 308072844 376374276  46% /var/media
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PREP UsbMedia: EXIT Status  0
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PROCESSING Begins
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: PROCESSING Status 0
Mar 29 04:40:01 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia: EXIT Begins
Mar 29 05:01:18 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia: usb_media: 23803/91586560 files (12.7% non-contiguous), 79892281/183143000 blocks
Mar 29 05:01:23 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia: fsck returned 1
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia: Starting NFS server daemons:
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia:   /usr/sbin/exportfs -r
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia:   /usr/sbin/rpc.rquotad
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia:   /usr/sbin/rpc.nfsd 8
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia:   /usr/sbin/rpc.mountd
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: POST UsbMedia: EXIT Status  0
Mar 29 05:01:24 merlin 00.daily.clean.usb_media[22061]: JOB Status 0

JOB 20.daily.backup.etc[22090] launched Mar 29 04:40:01
Mar 29 04:40:01 merlin 20.daily.backup.etc[22090]: JOB Begins
Mar 29 04:40:01 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: EXIT Begins
Mar 29 04:40:01 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: Will fsck filesystem and mount to /var/backup
Mar 29 04:53:52 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: usb_backup: 9256/39075840 files (3.9% non-contiguous), 31204891/78142160 blocks
Mar 29 04:53:54 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: /sbin/fsck returned 1
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: Filesystem     1K-blocks      Used Available Use% Mounted on
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: /dev/sdh1      307663800 119914724 172120644  42% /var/backup
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP UsbBackup: EXIT Status  0
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP DailyBackups: EXIT Begins
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP DailyBackups: backup frequency: daily
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP DailyBackups: backup retention: 7 days
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP DailyBackups: Initializing daily backup directory /var/backup/merlin.puter.lan/backups/daily/2013-03-29
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PREP DailyBackups: EXIT Status  0
Mar 29 04:53:55 merlin 20.daily.backup.etc[22090]: PROCESSING Begins
Mar 29 04:54:12 merlin 20.daily.backup.etc[22090]: PROCESSING Status 0
Mar 29 04:54:12 merlin 20.daily.backup.etc[22090]: JOB Status 0

JOB 20.daily.backup.htdocs[22115] launched Mar 29 04:40:01
Mar 29 04:40:01 merlin 20.daily.backup.htdocs[22115]: JOB Begins
Mar 29 04:53:55 merlin 20.daily.backup.htdocs[22115]: PROCESSING Begins
Mar 29 04:54:55 merlin 20.daily.backup.htdocs[22115]: PROCESSING Status 0
Mar 29 04:54:55 merlin 20.daily.backup.htdocs[22115]: JOB Status 0

JOB 20.daily.backup.mysql[22137] launched Mar 29 04:40:02
Mar 29 04:40:01 merlin 20.daily.backup.mysql[22137]: JOB Begins
Mar 29 04:53:55 merlin 20.daily.backup.mysql[22137]: PROCESSING Begins
Mar 29 04:54:30 merlin 20.daily.backup.mysql[22137]: PROCESSING Status 0
Mar 29 04:54:30 merlin 20.daily.backup.mysql[22137]: JOB Status 0

JOB 20.daily.backup.userdirs[22159] launched Mar 29 04:40:02
Mar 29 04:40:02 merlin 20.daily.backup.userdirs[22159]: JOB Begins
Mar 29 04:53:55 merlin 20.daily.backup.userdirs[22159]: PROCESSING Begins
Mar 29 05:16:36 merlin 20.daily.backup.userdirs[22159]: PROCESSING Status 0
Mar 29 05:16:36 merlin 20.daily.backup.userdirs[22159]: JOB Status 0

JOB 21.daily.backup.userdirs[22251] launched Mar 29 04:40:02
Mar 29 04:40:02 merlin 21.daily.backup.userdirs[22251]: JOB Begins
Mar 29 04:53:55 merlin 21.daily.backup.userdirs[22251]: PROCESSING Begins
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: PROCESSING Status 0
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: EXIT Begins
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: backup frequency: daily
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: backup retention: 7 days
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: Cleaning daily backup directory /var/backup/merlin.puter.lan/backups/daily
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: Deleting daily backup 2013-03-22
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST DailyBackups: EXIT Status  0
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: EXIT Begins
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: Will umount and fsck /var/backup
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: Filesystem     1K-blocks      Used Available Use% Mounted on
Mar 29 05:16:40 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: /dev/sdh1      307663800 122118072 169917296  42% /var/backup
Mar 29 05:25:04 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: usb_backup: 9283/39075840 files (3.9% non-contiguous), 31755728/78142160 blocks
Mar 29 05:25:07 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: /sbin/fsck returned 1
Mar 29 05:25:07 merlin 21.daily.backup.userdirs[22251]: POST UsbBackup: EXIT Status  0
Mar 29 05:25:07 merlin 21.daily.backup.userdirs[22251]: JOB Status 0

I built rmonitor to take some of the "heavy lifting" out of scripts that I run using run-parts(8). One of those problems that I meant rmonitor to address involved the management and reporting of long-running scripts, and what I came up with managed to adequately handle that requirement. Until it didn't.

A while back, I encountered a situation in which I had an extraordinarily-long-running rmonitor job, that I needed to kill and restart. Unfortunately, while the existing rmonitor code adequately handled the splitting out and reporting of this job, it didn't give me any way to properly kill the job. Specifically, I didn't know which of the processes to kill, and when I killed the "target" process, I didn't kill the "monitor" process.

I immediately recognized the error of my implementation; not only should rmonitor fork off children processes to run the target tasks in, it should take the role of session leader, in order to properly manage and distribute signals to the target tasks. With rmonitor as session leader, I could SIGHUP or SIGTERM the rmonitor process, and have it propogate those signals to it's children, giving me a way to initiate a controlled termination. And so, I made some changes.

With rmonitor-2.0.4, the rmonitor process intercepts signals sent to it, and propogates them forward to the children processes. Once all the children have terminated, the rmonitor process terminates cleanly. As the rmonitor process uses it's own PID when it logs both it's own messages and child messages to syslog, the external user will know exactly which process ID to kill; the one logged with the messages.