IP Monitoring & Diagnostics With Command Line Tools: Part 10 - Example Monitoring Probes
A server will experience problems when the processing demands hit a resource limit. Observing trends by measuring and comparing results periodically can alert you before that happens.
More articles in this series:
Servers are designed to be resilient. A few things can go so badly wrong that they kill a server. Reaching resource limits are the most likely causes. Disk space, available memory and CPU usage must all be carefully capacity planned. Rogue processes and network connectivity issues are also culpable. Capturing key measurements and checking them against the nominal values indicates if the server is running optimally.
Managing Potential Problems
Predicting the onset of a problem early that might escalate and bring down a server allows sufficient time to solve the problem in advance. If important services cannot respond to incoming connections this leads to client-side problems. A failure in one machine may cascade to others. Consuming most or all of a particular resource is a likely reason:
• Available CPU capacity.
• Available physical memory.
• Available memory swap space.
• Memory leaks per process.
• Disk space filling up.
• Disks or shared file systems going offline.
• Process counts.
• Zombie processes.
• File buffer limits per process.
• Number of files per file system.
• Network latency.
• Network bandwidth/throughput.
Define sensible measurement thresholds for these characteristics and watch for resource allocations rising to meet them. Resource limits are managed under these categories:
• System-wide hard maximum limits are often configured into the kernel. Only the ops team should alter them and then reboot the server.
• User limits apply to an individual login session. The ops team will define maximum values. The ulimit command tunes your session within them. Only the ops team can increase the upper limits.
• Application specific limits are imposed by services such as database managers. Alter these with their config files. There may be maximum values that require ops team intervention. In rare cases, the limits may be defined in the application source code, which requires a rebuild to change.
Designing your own measuring probes
Design your monitoring system in a disciplined way. Consider these decision points for each probe:
• What is being measured?
• How can it be measured?
• Which machines will deploy this test?
• How often will the measurement be taken?
• Capture and display a single value for a live status check or a time-based series for a trend?
• Cache the results in a file or database table?
• Is real-time feedback needed? Database caching is optimal for real-time data.
• Are the results pushed from the remote system or pulled by the central cortex?
• Is an HTTP web server end-point helpful for fetching results?
• Is file sharing useful for delivering results?
The examples demonstrate different measurement solutions. Dismantle them and reuse the ideas as a starting point for your own monitoring probes.
Don't forget to put a shebang at the top of every script so that it runs in the correct shell interpreter. These examples are all based on the bash shell.
Ping Testing For Reachability
The ping command will time-out and return a message that describes the packet loss if the target node is unreachable. Filter the result to detect that outcome.
MY_PACKET_LOSS=$(
ping -c 1 {host-name-or-ip} |
tr ',' '\n' |
grep "packet loss" |
cut -d ' ' -f 2
)
Split the result based on comma characters (,) by converting them to line-breaks (\n) with a tr command. Use grep to isolate the line with the "packet loss" message. Then cut will split the line using spaces as the delimiter. Take field 2 because there is a leading space on the line.
Now use an if test to check for 100% packet loss and record the target node as being unreachable.
if [ "${MY_PACKET_LOSS}" = "100.0%" ]
then
echo "Unreachable" >> ./MACHINE_STATUS.log
else
echo "Online" >> ./MACHINE_STATUS.log
fi
Use an equals character (=) to compare text strings. The -eq test expects integer values and would be inappropriate in this case.
Checking For Closed Network Ports
Use the nmap or netcat (nc) commands to check whether a port is open on a remote system.
These tools may need to be installed first because they are not always available by default.
The nc command is very easy to use:
nc -zv {host-name-or-ip} {port}
The -z flag checks the connection without transmitting any data if it succeeds. This avoids waking up remote daemons and triggering spurious activity in the remote machine. The -v flag provides the necessary verbose output as a result. Filter the result for the "Connection refused" message.
MY_PORT_TEST=$(nc -zv {host-name-or-ip} {port} 2>&1 |
tr ':' '\n' |
grep "Connection refused" |
sort -u |
wc -l |
tr -d ' ')
if [ "${MY_PORT_TEST}" -eq "1" ]
then
echo "Port closed"
else
echo "Port open"
fi
The useful part of the result is an error message delivered on the STDERR stream. This is not passed to the rest of the toolchain when commands are piped together with a vertical bar (|). The STDERR stream must be redirected into the STDOUT stream (2>&1) first in order to access its content. The tr command adds line-breaks so the grep command can filter the result. The sort -u command removes duplicate results. Then the wc command yields an integer 1 or 0 as a result. The final tr removes the whitespace introduced by the wc command. The result is tested with the -eq option which compares integer values:
• 1 = Port closed
• 0 = Port open
Add a logging line to note the hostname, symbolic name and timestamp with the port status.
Detecting missing disks
Entire disks may vanish if they are shared in from another system that suffers a network disconnect or shuts down. Hard drives can fail to spin up at boot time or dismount when they go wrong or overheat.
Filter the output of the df command and check whether the 'Mounted on' column lists the volume you expect to be there. A missing shared volume might indicate that one of your other machines is down.
df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 2451064 1027040 1321624 44% /
none 512652 0 512652 0% /dev
/tmp 516844 812 516032 1% /tmp
/volume1 516844 6108 510736 2% /volume1
/dev/shm 516844 4 516840 1% /dev/shm
Test for a missing mount point like this:
df | grep "/volume1" | wc -l
The result counts will be:
• 1 = Disk mounted - everything OK
• 0 = Missing disk
Measuring The Available Disk Space
Use grep to isolate the line you want and then cut to extract column 5 to obtain the percentage of the allocated space after collapsing the multiple space characters with a tr command.
MY_VOLUME1=$(df |
grep "/volume1"|
tr -s ' ' |
cut -d ' ' -f 5)
Trigger a warning when the usage threshold is exceeded.
Spotting Zombie Processes
Use a ps command to display a user defined format. This example presents just the process state, PID number and the command:
ps -eo stat,pid,comm
The stat column will contain a letter 'Z' if the process is Zombified.
STAT PID COMMAND
Ss 11679 mobileassetd
Z 13192 networkserviceproxy
S 19596 nsurlsessiond
S 30572 nsurlstoraged
S 32727 opendirectoryd
Use grep to detect a letter 'Z' in the first column. The circumflex character (^) represents the start of the line. Tell grep to ignore upper/lower case (-i) when matching:
grep -i "^Z"
Add a word-count and wrap this in a command substitution to assign the result to a variable:
MY_ZOMBIE_COUNT=$(ps -eo stat,pid,comm |
grep -i "^Z" |
wc -l)
Checking Physical Vs Swap Memory Space
The free -m command shows the available RAM and swap space. Install this add-on command if it is not already available by default. The swapon command is also useful but it provides less information.
free -m
total used free shared buff/cache available
Mem: 1009 215 15 56 778 654
Swap: 2047 0 2047
A quarter of the available physical memory appears to be in use. There has been no need to use any swap-space so far.
Extract the most useful items like this:
TOTAL_MEMORY=$(free -m | grep "Mem:" | tr -s ' ' | cut -d ' ' -f 2)
USED_MEMORY=$( free -m | grep "Mem:" | tr -s ' ' | cut -d ' ' -f 3)
TOTAL_SWAP=$( free -m | grep "Swap:" | tr -s ' ' | cut -d ' ' -f 2)
USED_SWAP=$( free -m | grep "Swap:" | tr -s ' ' | cut -d ' ' -f 3)
If ${USED_MEMORY} is less than ${TOTAL_MEMORY} and $[{USED_SWAP} is 0 you have sufficient physical memory. If $[{USED_SWAP} increases significantly add more physical memory.
Detecting An Overloaded CPU
Use the top command to reveal CPU loading as well as memory usage.
top -b -n 1
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
1 root 20 0 4.3m 2.6m 0.0 0.3 0:12.78 S /sbin/init
2 root 20 0 0.0m 0.0m 0.0 0.0 0:00.00 S [kthreadd]
Column 7 indicates how much of the CPU capacity is consumed by each process. The memory usage is in column 8.
A process might occasionally require 100% of the CPU for a few moments before going back to zero. Processes that hog the CPU capacity continuously need to be inspected to find out why.
The ps command is also useful for observing CPU usage but it uses a different method to calculate the percentage used. This example ranks the processes by CPU usage and lists only the top 10 culprits as process IDs:
ps -eo %cpu=,pid= | sort -r | head -10
Detecting Potential Memory Leaks
Observing an increasing memory usage over time for a constantly running process suggests a potential memory leak in that application. Fix the code to remove the leak.
In the meantime, regularly stopping and restarting the process may help avoid using up all the memory.
Conclusion
Be aware of the differences in command output for each operating system and alter these examples accordingly.
Infer problems in other machines by attempting a connection or detecting that a shared disk that it vends is not mounted.
Drill down into the results and track them over a time period. Measuring disk capacity on a daily basis allows you to predict when a disk will become full by observing the trend. Schedule a disk upgrade or clean-up to remove unnecessary content before that happens.
Predicting failures and preventing them is much easier than waiting for them to happen and rectifying the damage afterwards. It is much kinder to your end-users and much less hassle for you.
You might also like...
HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows
Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…
IP Security For Broadcasters: Part 4 - MACsec Explained
IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.
Standards: Part 23 - Media Types Vs MIME Types
Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.
Building Software Defined Infrastructure: Part 1 - System Topologies
Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…
IP Security For Broadcasters: Part 3 - IPsec Explained
One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…