Cloud Best Practices - Part 2
In Part 1 of this series we looked at how availability zones improve reliability and business continuity, in this part we look at improving systems further using chaos engineering and SLAs.
Other articles from this series.
Microservice Implementation
The physical restraints of monolithic designs make scaling them very difficult, especially in the highly dynamic systems that are now prevalent in broadcasting. Microservices are continuing to advance, and not only do they deliver exceptional scalability and resilience, but they can also be cloud service provider agnostic, thus allowing broadcasters to diversify their infrastructures much easier.
Cloud vendors do provide microservice and container specific components, but they do not have to be used. If the broadcaster is willing to invest in the microservices learning curve, then they can deploy their own virtual machines and build the required containerized infrastructure on them. This further allows broadcasters to either distribute the servers across different multiple cloud regions or distribute over entirely different cloud service providers. Other than learning how to deploy dynamic microservice architectures, one of the other significant challenges is one of latency.
Using internationally distributed cloud regions allows broadcasters to position the physical datacenter hardware closer to their clients, and this has the potential to significantly reduce latency. Also, interconnected regional datacenters from the same vendor tend to be connected with high speed and dedicated networks, and this allows data to be moved with relatively low latency, much lower than could be achieved with internet connectivity.
However, to achieve the best results for container and microservice architectures, system architects must design with distributed processing, low latency, and security, from the very beginning of the build. This also facilitates the ability to build high levels of redundancy into the broadcast system at all levels of the workflows.
As viewer requirements change, broadcasters can see how the systems stress and where, so they can increase capacity as required. And as systems progress and broadcasters learn more about microservice architectures, they can even automate much of the scaling to further improve resilience and flexibility.
Chaos Engineering
It might seem a bit of a contradiction that having spent so much time and effort making broadcast workflows operate efficiently and reliably, that we should then purposely try and break them. But this is exactly how chaos engineering works.
The principle is to introduce controlled chaos into a system so that system weaknesses can be quickly identified allowing DevOps teams to find strategies to strengthen them and improve overall resilience. Although this is a powerful tool in improving reliability, it should only be conducted as part of a planned process. Deliberately deleting the microservice system that plays the opening heads of the six-o-clock news with two minutes to transmission, with no planning, is clearly a bad idea.
Figure 2 – Chaos engineering is a continuous practice that aims to improve resilience by testing and stressing infrastructures for unexpected events.
However, this does demonstrate the power of distributed systems such as microservice architectures that have been built from the ground-up with resilience and backup in mind. With adequate planning, it should be possible to remove network cables, switch off routers and servers, and delete applications. However, no matter how resilient a designer thinks the architecture should be, it’s only when it’s tested using a chaos engineering type approach does the true validity of their design become apparent.
Chaos engineering isn’t just a one-off test, instead, it’s an integral part of the system with “chaos” being injected into the architecture on a regular basis (with adequate planning). And this isn’t just restricted to the microservice components but expands to the whole broadcast system. For example, what would happen if the electrical power supply was switched off to the datacenter? Would the UPS take the load for enough time until the stand-by generator was switched-on and stable? It’s better to know where the failures are within a planned test than find them during the middle of the night when a real fault occurs.
Just like agile processes that govern DevOps, chaos engineering is a way of life that should be embraced as it will continuously improve performance and resilience of an IP broadcast infrastructure.
Continuous System Performance Monitoring and SLAs
Monitoring IT systems goes way beyond the video and audio signal monitoring broadcasters have become accustomed to. Although this is still important and plays a big role for broadcasters, monitoring IP networks, the resource they’re using, and the costs being incurred, are equally important.
With dynamic infrastructures, especially those using public cloud services, new resource is added as required which incurs extra cost. Not only is it important to know when more resource is being allocated, but the effectiveness and efficiency of the algorithms deciding on the new allocation must also be continuously scrutinized so that they’re not over or under scheduling cloud services. Also, DevOps teams can learn a lot from how the system is behaving overall and pre-empt any problems that may be about to occur.
One example is a low API response time, this could be due to an unusually long database query or network congestion, but the monitoring will give a more focused indication to where the latency is occurring. Research has demonstrated users can adapt to reasonable amounts of constant latency, but variable latency is difficult to work with and problematic. Due to the dynamic nature of the infrastructure, the latencies can develop and concatenate without warning, hence the need to provide continuous monitoring.
As well as providing deep insight into the operation of the system, continuous monitoring highlights areas of the infrastructure that would best benefit from service level agreements (SLAs) and determine the extent of their cover. Providing blanket SLAs for every piece of equipment or service is costly and inefficient, especially as many vendors now provide varying levels of SLA. Being able to allocate the best SLA for each part of the infrastructure components will deliver optimal resilience and efficiency.
Conclusion
Determining best practices for building hybrid, on- and off-prem cloud and virtualized infrastructures is a methodology that must be considered at the very beginning of the design. These practices further develop as the functional aspects of the workflows expand, and therefore need to be under constant review. Due to their dynamic nature, best practices, like so many other aspects of broadcast IP infrastructure design and maintenance, should be considered an ongoing process that is always open to development and improvement.
Supported by
You might also like...
HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows
Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…
IP Security For Broadcasters: Part 4 - MACsec Explained
IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.
Standards: Part 23 - Media Types Vs MIME Types
Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.
Building Software Defined Infrastructure: Part 1 - System Topologies
Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…
IP Security For Broadcasters: Part 3 - IPsec Explained
One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…