IP Monitoring & Diagnostics With Command Line Tools: Part 9 - Continuous Monitoring
Scheduling a continuous monitoring process will detect problems at the earliest opportunity. If the diagnostic tools run often enough, they can forecast a server outage before a mission critical failure happens. Pre-emptive diagnosis and automatic corrections are a very good thing.
More articles in this series:
Continuous monitoring is a powerful tool for predicting failures when the system exhibits symptoms that are difficult to spot with a manual inspection. Some observations need to be made more often than others to detect a pattern. A flexible solution that is easy to maintain and extend can be built using operating system services as a foundation.
Why continuous monitoring is a good idea.
A manual monitoring approach is useful when diagnosing specific problems in a single machine. In a large and increasingly complex network, automation is necessary to avoid being overwhelmed.
In high availability scenarios that support live broadcasting, a problem may arise that will eventually crash the machine if it is not rectified. Detecting this as soon as the symptoms are evident can alert the support team well in advance. They can pre-emptively correct the issue before it becomes critical.
An operating system is composed of many individual processes. There is a strict limit on how many of these can run simultaneously. A server process might spawn child processes to deal with incoming requests. If a child process loses contact with its parent, the relationship is deadlocked. The parent process waits for a response that will never arrive and the child will not quit because it cannot pass back the exit status. If this is caused by a systemic problem, other processes will stall too. Eventually, all of the process slots will be allocated and new processes cannot be created. That will halt a server completely. A forced server reboot is the only solution.
Count the processes that are prone to this happening and compare the historical values. If the count increases above a nominal threshold, remedial action can remove the cause of the failures and dispose of the defunct processes in an orderly fashion so the system can resume normal operation.
The corrective action could be invoked automatically with self-healing code. This is an additional layer of pre-emptive support over and above the defensive coding that we have already discussed.
What is cron?
There is a versatile and powerful scheduler called cron built into UNIX. Add tasks to the configuration in a cron-table file to call tools and scripts to action. The tasks can be configured to run according to a set of rules (Time-specs). For example, gather information daily, then collate it and email a report every Monday morning.
The cron daemon checks the task list every minute and will execute anything whose Time-spec matches the current date and time.
About the cron tables
The configuration for the cron scheduler is maintained via a table of tasks. Each one has a Time-spec that describes when it should run. This is the cron-table (called crontab). There are two variants of the crontab files:
- System wide
- Per-user
The system wide crontab is used for various housekeeping and background tasks that the OS needs to run. We should leave it alone.
The per-user cron-tables are owned by the individual accounts. The cron tasks will run under the user account to which they belong. You cannot view or alter the crontab for another user account unless you have super-user privileges.
Avoid running tasks with the root account. If the task requires elevated privileges, grant them to a special user account and use that instead.
Using the crontab command
Scheduled execution is a feature of all operating systems but it may be implemented differently on some. There are several alternative cron-table files and their paths have changed from time-to-time. Apple has replaced cron with their own launchd process. The crontab command hides these complexities from you and is easier to use than manually finding and editing the config files.
Confusingly, crontab describes a command and file that it operates on.
Use the crontab -e command to edit the per-user crontab files. It knows where they live and can find the right one. Opening the crontab will create a new and empty file if it does not already exist:
crontab -e
The crontab will be opened with the default text editor. Use a different editor by adding this special variable export instruction to your login profile:
export EDITOR={path-to-your-preferred-editor}
List your own crontab to see the changes with the listing-flag (small letter L):
crontab -l
Beware: Do not use the crontab command without parameters. It will replace your personal crontab with an empty file and your tasks will be removed. If you do this accidentally, abort your editing session with a [CONTROL] + [C] keystroke to leave without overwriting the file.
When you exit and save the changes, the crontab -e command should signal the cron daemon to reload the configuration to activate the new tasks. If this does not happen automatically, reload it manually like this:
kill -HUP {cron-process-PID-value}
Use command substitution to build a signalling instruction (line-breaks added for clarity):
kill -HUP $(ps -aux |
grep -i "\/crond" |
grep -v grep |
tr -s ' ' |
cut -d ' ' -f 2)
The grep commands filter the ps listing to extract the line we need. The second is needed to discard the first grep command from the list. The tr and cut commands return the PID number from the result. The substitution passes the PID number to the kill command.
Although the command is named kill, it should be called something more benign because it sends signals to processes.
Configure the run-time environment
The run-time environment can be altered with optional special variables at the head of the crontab:
Definition | Description |
---|---|
SHELL=/bin/bash | Override the default shell for the user account. |
MAILTO=anotheruser | All output from the task is sent by email unless it is redirected. Define the recipient here. |
CRON_TZ=London | Localise the task to run with a different time-zone setting. |
Note: This environment will apply to all tasks described in the crontab.
Crontab task entries
The format of a crontab line is very simple. There are five space-separated values to describe a Time-spec value when the task will be called to action. The rest of the line describes the command to be run:
{time-spec} {task-command-line}
Tasks are deactivated with a hash character (#) prefix. This prevents the task from being scheduled but keeps it intact for later use.
#{time-spec} {task-command-line}
Embedded percent signs (%) represent newline characters. The second and subsequent virtual lines are redirected to the standard input of the command described prior to the first percent sign.
{time-spec} {task-command-line}%{redirected-to-stdin}
Redirecting the output of the command to /dev/null (or any other file) inhibits the mail message containing the task output.
{time-spec} {task-command-line} > /dev/null
Time-spec format
The space-separated Time-spec describes when a task is scheduled to run:
{minute} {hour} {day-of-month} {month} {day-of-week}
Field | Value range |
---|---|
{minute} | 0 to 59 |
{hour} | 0 to 23 |
{day-of-month} | 1 to 31 depending on the month. |
{month} | 1 to 12 or a three-letter abbreviation. |
{day-of-week} | 0 to 6 (Sunday to Saturday) or a three-letter abbreviation. |
Use a wildcard asterisk (*) to match all possible values. A range of values can be specified with a dash character (-) and a comma (,) can be used to separate a list of values or ranges.
The task will run if either or both the {day-of-week} and the {day-of-month}+{month} patterns match the current day.
Here are some Time-spec examples:
Time-spec | Description and example purpose |
---|---|
0 8 * * 1 | 8:00 AM Monday - Deliver a weekly report. |
0 4 * * * | 4:00 AM every morning - Run a garbage collection task. |
* * * * * | Run every minute - Measure disk space, count processes or check workflow queues for stalled jobs, intrusion checks. |
0 * * * * | Run once an hour - Database backups. |
0 0 * * * | At midnight - Rotate the log files. |
0 0 * * 0 | Every week at midnight on Sunday - Analyse data for reports. |
0 0 1 * * | Every month on the first morning - Housekeeping tasks. |
0 0 1 3,6,9,12 * | Every 3 months - Compile reports. |
0 0 1 1 * | New Year's Day - Big garbage collection. |
The complete crontab line for delivering a weekly report looks like this. Email is inhibited here because that would be handled inside the script:
0 8 * * 1 /my_tools/run_weekly_report.sh > /dev/null
Deploying tasks
The crontab tool is easy to use but accessing it from a dashboard implemented in PHP is difficult.
Adding a layer of abstraction can simplify your architecture at the expense of a little extra coding. Using data-driven techniques to let the file-system work for you results in more flexible designs.
Implement a task manager written as a shell-script. The task manager is called by cron but loads plug-in tasks from a folder. These are picked up with a ls command and passed to a while loop to execute them one-by-one. Tasks can be added or removed without needing to rebuild the crontab. We will explore this idea in more detail soon.
Conclusion
Build monitoring tasks with simple components and defensive coding techniques. Implement self-healing code to fix problems automatically. Almost no maintenance is required after deployment unless you alter something they depend on. Strive for elegant simplicity.
You might also like...
HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows
Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…
IP Security For Broadcasters: Part 4 - MACsec Explained
IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.
Standards: Part 23 - Media Types Vs MIME Types
Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.
Building Software Defined Infrastructure: Part 1 - System Topologies
Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…
IP Security For Broadcasters: Part 3 - IPsec Explained
One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…