If It Ain’t Broke Still Fix It: Part 1 - Reliability

IP is an enabling technology which provides access to the massive compute and GPU resource available both on- and off-prem. However, the old broadcasting adage: if it ain’t broke don’t fix it, is no longer relevant, and potentially highly dangerous, especially when we consider the consequences of failed hardware.

Other articles in this series:

If It Ain’t Broke Still Fix It: Part 2 - Security

Hardware reliability has been a concept at the heart of every broadcast facility design as traditional SDI/AES workflows almost exclusively used hardware, often with 100% inbuilt redundancy. But this was at a time when change was measured in years, not months, or even weeks as it is now. One of the greatest achievements of IP is that it has allowed many broadcasters to take advantage of COTS type equipment, but this also has the potential to be a weakness if we don’t treat IT equipment with the respect it deserves.

Why Servers Fail

Computer servers only have a few moving parts, such as fans, to cool the electronics and switches, or to turn the server on and off (which are only occasionally used). So, the question is, why should we be worried about their reliability? Just like SDI/AES equipment, servers suffer the same potential catastrophic failures due to hardware malfunction. Although very rare, servers do suffer from dry joints, component failure and environmental impact. Stress fractures may happen when a server has been dropped off the loading ramp on a cargo plane but be hidden until the right combination of heating and cooling cycles causes a circuit board fracture to manifest itself. They are reliable but are still prone to failure.

MTBF (Mean Time Between Failure) should help us determine how reliable a piece of hardware is; but we soon start to get into trouble with this measurement (assuming the vendor supplies it) as there are so many factors outside of the immediate hardware that have the potential to affect its reliability. For example, the operational environmental temperature, pressure, vibration and heat from other equipment, air purity and humidity, and power quality, to name but a few, all affect the server’s reliability. Spikes in the power supply can have a significant impact on the longevity of a server’s power supplies, even if overvoltage suppressors are installed.

One of the challenging aspects of relying on MTBF to provide an absolute measure is that it can show an increase in failure. For example, a hard disk drive may have an MTBF of 750,000 hours, in other words, on average an HDD will fail once every 750,000 hours. So, if we double the number of HDDs then the number of potential failures increases to two in 750,000 hours, and if we triple the number of HDDs then the potential for failure becomes three units in 750,00 hours, or on average, one failure every 250,000 hours. From this simple example, we can see that increasing the number of HDDs appears to reduce the reliability of the system.

Resource – Reliability Compromise

RAID storage solves this conundrum as we trade reliability for storage capacity. RAID5 uses disk striping and parity to greatly improve data redundancy such that should a disk drive fail, the data storage and retrieval operation is still reliable, and even better, the faulty drive can be replaced without switching the system off. The RAID5 controller will stripe the data from the other four disk drives with the associated parity over the newly installed disk, thus restoring the RAIDs original reliability. Here, we have traded data integrity and reliability for storage capacity as the five drives only store as much useable data as one single drive.

Does this then mean that a RAID5 system is infinitely reliable? No, is the quick answer. Statistically, for all five HDDs to simultaneously fail in one RAID5 is highly improbable, but it’s not impossible! If we assume that the common factors of all five HDDs can be discounted, such as the power rails and driver chips connected to the common data and address busses and the failure of one HDD doesn’t pop another HDD by spiking the power rail, then each HDD is in effect an independent statistical event. Just because each of the five HDDs has an MTBF of 750,000hrs, then it doesn’t follow that if HDD0 fails, then HDD1-4 will not. There is statistical probability, albeit very small, that enough HDDs could fail within the same time-frame window to render the complete RAID5 device useless. To reiterate, the HDDs are independent systems with their own event probability spaces.

Approaching 100% Reliability

Have we therefore concluded that MTBF is completely useless? No, not at all as it acts as a very good aid for design engineers developing systems such as a RAID5 or computer servers. Then, how far do we take this? If the RAID5 isn’t 100% reliable then how can we be sure that our high value media is 100% safe. We can’t! But we can back-up the RAID5 system to another onsite storage facility, which can then be backed up to a public cloud provider, which can then be backed up to an underground bunker facility, and on it goes. The same can be said of servers and other infrastructure equipment. Where do we stop with our backups to backups? Usually when the Finance Director says NO!

It could be argued that a computer server is not as reliable as a RAID5 network storage device as it doesn’t have inbuilt redundancy at a CPU level. Although we do have to be very careful with this analogy as the RAID5 network storage device probably only has one main CPU controller, so the reliability of the whole system is potentially compromised, but the data integrity is not. Even if the RAID controller fails, then the striped and parity balanced data across all five HDDs should still be valid. This highlights the need for server resilience.

Assuming Failure

An alternative way of thinking about workflow reliability is to assume that equipment will fail, and then develop strategies to deal with this. The concept goes way beyond the idea of dual A-B redundancy many engineers have become accustomed to in traditional SDI/AES workflows. Instead of trying to make servers more and more reliable, we assume that one in x number of servers will fail over a given time-window, such as a month.

Modern high-end servers can process 30+ HD transport streams for playout simultaneously. This is a fantastically efficient use of resource but is it the most reliable method of operation? If the server fails, then all 30 services simultaneously fail. But if multiple servers are used and the transport streams are distributed between them, then a more reliable system can be built. By taking this approach, we move away from just engineering decisions to business-engineering decisions where the overall solution becomes a compromise between the amount of available money and the risks the business wants to take. It’s important to note that a broadcaster (or anybody else) can never achieve 100% certainty, and after a while the returns achieved in terms of reliability become ever smaller with the increasing amounts of resource, and hence money, added to the system. This is the concept of diminishing returns.

Resource Divergence & Microservices

Virtualization has contributed greatly to server redundancy and resilience to provide greater reliability. Multiple host clusters can be configured so that a failed server will have little or no impact on the business, including broadcast workflows. But an even better approach is that of distributed services. By dividing the processes into smaller and smaller functions and distributing them over more and more compute resource, including network diversity, we have the concept of microservices.

With suitable software management strategies, microservices can greatly improve a workflows reliability in the same way that RAID has done for data storage and retrieval. The whole workflow can be separated over geographically dispersed areas so that the resilience can be significantly improved. Furthermore, this divergence can be dynamic so that extra resource can be dialed-in when more resilience is required. For example, the live Super Bowl transmission will need much more resilience than a 3am sitcom re-run, thus providing incredible compromise between cost and reliability.

To be truly reliable, the entire workflow must be thoroughly tested, and this means regular active and continuous failure testing. Even if a server has an MTBF of 750,000 hours and a broadcast facility employs 1,000 servers throughout its workflows, then on average, one server could fail every month (750,000hrs / 1,000 servers / 24 hours), what we don’t know with any certainty is which server will fail and when. If we build a fully resilient infrastructure that could tolerate two simultaneous server failures, then we should be looking at randomly switching one server off per week and watching to see how the infrastructure recovers and self-heals. Some may be shouting “if it ain’t broke, don’t fix it”, I’m saying “it will break, so make sure you’ve properly tested and prepared for failure”.

In part 2 we move on to how security now demands regular scheduled maintenance.

Other related articles posted on The Broadcast Bridge.

If It Ain’t Broke Still Fix It: Part 2 - Security

You might also like...

Building Software Defined Infrastructure: Monitoring Microservices

Breaking production systems into individual microservice based processors, requires monitoring over IP via RESTful APIs and a database system to capture the results.

Monitoring & Compliance In Broadcast: Monitoring QoS & QoE To Power Monetization

Measuring Quality of Experience (QoE) as perceived by viewers has become critical for monetization both from targeted advertising and direct content consumption.

IP Monitoring & Diagnostics With Command Line Tools: Part 5 - Using Shell Scripts

Shell scripts enable you to edit your diagnostic and monitoring commands into a script file so they can be repeated without needing to type them manually every time. Shell scripts also offer some unique and powerful features that help to…

Building Software Defined Infrastructure: Observability In Microservice Architecture

Building dynamic microservices based infrastructure introduces the potential for variable latency which brings new monitoring challenges that require an understanding of observability.

Broadcast Standards: Kubernetes & The Architecture Of Cloud Compute Based Systems

Here we describe Kubernetes and the taxonomy of containerized architecture based cloud compute system designs it manages.