IP Monitoring & Diagnostics With Command Line Tools: Part 9 - Continuous Monitoring

Scheduling a continuous monitoring process will detect problems at the earliest opportunity. If the diagnostic tools run often enough, they can forecast a server outage before a mission critical failure happens. Pre-emptive diagnosis and automatic corrections are a very good thing.


More articles in this series:


Continuous monitoring is a powerful tool for predicting failures when the system exhibits symptoms that are difficult to spot with a manual inspection. Some observations need to be made more often than others to detect a pattern. A flexible solution that is easy to maintain and extend can be built using operating system services as a foundation.

Why continuous monitoring is a good idea.

A manual monitoring approach is useful when diagnosing specific problems in a single machine. In a large and increasingly complex network, automation is necessary to avoid being overwhelmed.

In high availability scenarios that support live broadcasting, a problem may arise that will eventually crash the machine if it is not rectified. Detecting this as soon as the symptoms are evident can alert the support team well in advance. They can pre-emptively correct the issue before it becomes critical.

An operating system is composed of many individual processes. There is a strict limit on how many of these can run simultaneously. A server process might spawn child processes to deal with incoming requests. If a child process loses contact with its parent, the relationship is deadlocked. The parent process waits for a response that will never arrive and the child will not quit because it cannot pass back the exit status. If this is caused by a systemic problem, other processes will stall too. Eventually, all of the process slots will be allocated and new processes cannot be created. That will halt a server completely. A forced server reboot is the only solution.

Count the processes that are prone to this happening and compare the historical values. If the count increases above a nominal threshold, remedial action can remove the cause of the failures and dispose of the defunct processes in an orderly fashion so the system can resume normal operation.

The corrective action could be invoked automatically with self-healing code. This is an additional layer of pre-emptive support over and above the defensive coding that we have already discussed.

What is cron?

There is a versatile and powerful scheduler called cron built into UNIX. Add tasks to the configuration in a cron-table file to call tools and scripts to action. The tasks can be configured to run according to a set of rules (Time-specs). For example, gather information daily, then collate it and email a report every Monday morning.

The cron daemon checks the task list every minute and will execute anything whose Time-spec matches the current date and time.

About the cron tables

The configuration for the cron scheduler is maintained via a table of tasks. Each one has a Time-spec that describes when it should run. This is the cron-table (called crontab). There are two variants of the crontab files:

  • System wide
  • Per-user

The system wide crontab is used for various housekeeping and background tasks that the OS needs to run. We should leave it alone.

The per-user cron-tables are owned by the individual accounts. The cron tasks will run under the user account to which they belong. You cannot view or alter the crontab for another user account unless you have super-user privileges.

Avoid running tasks with the root account. If the task requires elevated privileges, grant them to a special user account and use that instead.

Using the crontab command

Scheduled execution is a feature of all operating systems but it may be implemented differently on some. There are several alternative cron-table files and their paths have changed from time-to-time. Apple has replaced cron with their own launchd process. The crontab command hides these complexities from you and is easier to use than manually finding and editing the config files.

Confusingly, crontab describes a command and file that it operates on.

Use the crontab -e command to edit the per-user crontab files. It knows where they live and can find the right one. Opening the crontab will create a new and empty file if it does not already exist:

crontab -e

The crontab will be opened with the default text editor. Use a different editor by adding this special variable export instruction to your login profile:

export EDITOR={path-to-your-preferred-editor}

List your own crontab to see the changes with the listing-flag (small letter L):

crontab -l

Beware: Do not use the crontab command without parameters. It will replace your personal crontab with an empty file and your tasks will be removed. If you do this accidentally, abort your editing session with a [CONTROL] + [C] keystroke to leave without overwriting the file.

When you exit and save the changes, the crontab -e command should signal the cron daemon to reload the configuration to activate the new tasks. If this does not happen automatically, reload it manually like this:

kill -HUP {cron-process-PID-value}

Use command substitution to build a signalling instruction (line-breaks added for clarity):

kill -HUP $(ps -aux |
   grep -i "\/crond" |
   grep -v grep |
   tr -s ' ' |
   cut -d ' ' -f 2)

The grep commands filter the ps listing to extract the line we need. The second is needed to discard the first grep command from the list. The tr and cut commands return the PID number from the result. The substitution passes the PID number to the kill command.

Although the command is named kill, it should be called something more benign because it sends signals to processes.

Configure the run-time environment

The run-time environment can be altered with optional special variables at the head of the crontab:

Definition Description
SHELL=/bin/bash Override the default shell for the user account.
MAILTO=anotheruser All output from the task is sent by email unless it is redirected. Define the recipient here.
CRON_TZ=London Localise the task to run with a different time-zone setting.

Note: This environment will apply to all tasks described in the crontab.

Crontab task entries

The format of a crontab line is very simple. There are five space-separated values to describe a Time-spec value when the task will be called to action. The rest of the line describes the command to be run:

{time-spec} {task-command-line}

Tasks are deactivated with a hash character (#) prefix. This prevents the task from being scheduled but keeps it intact for later use.

#{time-spec} {task-command-line}

Embedded percent signs (%) represent newline characters. The second and subsequent virtual lines are redirected to the standard input of the command described prior to the first percent sign.

{time-spec} {task-command-line}%{redirected-to-stdin}

Redirecting the output of the command to /dev/null (or any other file) inhibits the mail message containing the task output.

{time-spec} {task-command-line} > /dev/null

Time-spec format

The space-separated Time-spec describes when a task is scheduled to run:

{minute} {hour} {day-of-month} {month} {day-of-week}

Field Value range
{minute} 0 to 59
{hour} 0 to 23
{day-of-month} 1 to 31 depending on the month.
{month} 1 to 12 or a three-letter abbreviation.
{day-of-week} 0 to 6 (Sunday to Saturday) or a three-letter abbreviation.

Use a wildcard asterisk (*) to match all possible values. A range of values can be specified with a dash character (-) and a comma (,) can be used to separate a list of values or ranges.

The task will run if either or both the {day-of-week} and the {day-of-month}+{month} patterns match the current day.

Here are some Time-spec examples:

Time-spec Description and example purpose
0 8 * * 1 8:00 AM Monday - Deliver a weekly report.
0 4 * * * 4:00 AM every morning - Run a garbage collection task.
* * * * * Run every minute - Measure disk space, count processes or check workflow queues for stalled jobs, intrusion checks.
0 * * * * Run once an hour - Database backups.
0 0 * * * At midnight - Rotate the log files.
0 0 * * 0 Every week at midnight on Sunday - Analyse data for reports.
0 0 1 * * Every month on the first morning - Housekeeping tasks.
0 0 1 3,6,9,12 * Every 3 months - Compile reports.
0 0 1 1 * New Year's Day - Big garbage collection.

The complete crontab line for delivering a weekly report looks like this. Email is inhibited here because that would be handled inside the script:

0 8 * * 1 /my_tools/run_weekly_report.sh > /dev/null

Deploying tasks

The crontab tool is easy to use but accessing it from a dashboard implemented in PHP is difficult.

Adding a layer of abstraction can simplify your architecture at the expense of a little extra coding. Using data-driven techniques to let the file-system work for you results in more flexible designs.

Implement a task manager written as a shell-script. The task manager is called by cron but loads plug-in tasks from a folder. These are picked up with a ls command and passed to a while loop to execute them one-by-one. Tasks can be added or removed without needing to rebuild the crontab. We will explore this idea in more detail soon.

Conclusion

Build monitoring tasks with simple components and defensive coding techniques. Implement self-healing code to fix problems automatically. Almost no maintenance is required after deployment unless you alter something they depend on. Strive for elegant simplicity.

You might also like...

Designing IP Broadcast Systems - The Book

Designing IP Broadcast Systems is another massive body of research driven work - with over 27,000 words in 18 articles, in a free 84 page eBook. It provides extensive insight into the technology and engineering methodology required to create practical IP based broadcast…

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.

If It Ain’t Broke Still Fix It: Part 2 - Security

The old broadcasting adage: ‘if it ain’t broke don’t fix it’ is no longer relevant and potentially highly dangerous, especially when we consider the security implications of not updating software and operating systems.

Standards: Part 21 - The MPEG, AES & Other Containers

Here we discuss how raw essence data needs to be serialized so it can be stored in media container files. We also describe the various media container file formats and their evolution.

NDI For Broadcast: Part 3 – Bridging The Gap

This third and for now, final part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to a trio of tools released with NDI 5.0 which are all aimed at facilitating remote and collaborative workflows; NDI Audio,…