A system with an availability of 99.999% can be down for a maximum of 52 minutes per year.

Most of us grew up being told that “no one is perfect.” The legendary football coach, Vince Lombardi said, “Perfection is not attainable, but if we chase perfection, we can catch excellence. When it comes to the typical enterprise network uptime, most IT professionals would agree that true perfection is unattainable, at least for now. While IT admins may not shoot for the stars, by and large, we aim to shoot for the next best thing, the “five-nines.”  

The five-nines is the nirvana that we aim to strive for.  

It represents 99.999 percent uptime. That translates to just over 5 minutes of downtime a year. The elimination of one of those nines (99.99%) results in a little over 52 minutes of downtime per year. Reduce it to 99.9%, and we start talking about being down for more than an entire 8-hour shift. Because 99.999% is so close to perfection, only about 10% of enterprises can achieve it. It requires a lot of work and dedication to wear the crown of the five-nines, but it’s something that customers, employees, partners, and management have come to expect in today’s digitally transformed world. 

How to Achieve the Five-Nines


Obtaining a near faultless state of reliability starts with purchasing high-quality equipment and bug-free software to run on top of it.  It requires coordination and detailed management regarding hardware maintenance, upgrades, migrations, updates, and required reboots of servers and infrastructure devices.  Strict adherence to best practice implementation and configuration is essential as well.  However, the pursuit of minimal downtime involves the 2 R’sRedundancy and Resiliency

Learn More: NetFoundry on Why Zero Trust Networking Is a Business Imperative 

Defining Redundancy and Resiliency 


While both principles are critical to the design and deployment of your data center and enterprise infrastructure, the two terms are sometimes muddled together. Let’s start by defining these two essential terms of network architecture.

    • Redundancy is the concept of duplicating system components to ensure a system’s dependability. 
    • Resiliency is a system’s ability to recover from some type of failure or continue operating despite it, thus avoiding any disruption to normal operation.

While the two of them are defined differently, they naturally intertwine with one another. Resiliency requires redundancy, and redundant components need to be resilient to maximize the effectiveness of redundant designs. It’s important to remember that the redundant path must adequately perform for the full duration of time to repair the first one. To use the analogy of a car, while a temporary spare may allow you to drive the necessary distance to a repair place, it may not prove robust enough to get you to your final destination within a given deadline. Uptime doesn’t imply simple connectivity. You must keep the user experience alive throughout.  

Stretching the Limits of Redundancy


So here’s a simple equation to remember.  Greater redundancy = greater resiliency.  To illustrate, let’s take the example of the core switch in an average data center, which serves as a single point of failure when lacking any sort of redundancy.  At the very least, you want to have two power supplies to ensure resiliency if one of them fails.  While this is a good start, the resiliency coverage of redundant power supplies is restricted to the individual components themselves so let’s go down the rabbit hole a little further.

In the event of a power disruption, plugging each power supply into separate UPS systems will expand the scope of redundancy.  Now let’s plug each UPS into its own dedicated power circuit.  We can even carry it out further by ensuring that power for each circuit is derived from a separate power grid if need be.

While we have fully scaled out the resiliency architecture that will power this core switch, we have yet to approach its other fault avenues.  To do so requires examining how the switch operates throughout the various layers of the OSI Model.  For instance, creating redundant links using the Link Aggregation Control Protocol (LACP) for quintessential connections is recommended.  If the switch is also handling basic routing functions, you have to consider layer three as well.  At the very least, redundant gateway paths are necessary, which may involve dynamic routing protocols.

Learn More: Edge Data Centers: The Cure for Latency 

Different Types of Redundancy 


What if the switch OS becomes corrupt after a firmware upgrade? Failure of this magnitude requires box-level redundancy, which can be achieved by implementing a second physical box or utilizing stackable switches. While ensuring that both boxes operate the exact same software versions, that isn’t always the best practice in some cases. 

For instance, what if your data center became infected with a malware strain that targeted a particular software exploit found in a specific server OS version. In that case, both boxes could go down in an equal fashion. 

This is why there are multiple types of redundancy designs.

    • Active redundancy – In this case, the paired components operate simultaneously. In addition to its redundant design, active redundancy can also distribute traffic evenly across shared paths, thus speeding performance. This type of design is commonly used for perimeter firewalls.
    • Passive redundancy – While recovery may not be as instantaneous as AR, it provides a simpler design and is ideal for application or storage components.
    • Homogeneous redundancy – This involves pairing a designated device with the exact brand and model to create a uniform, identical redundant pair. This is prevalent for switches, access points and other network infrastructure devices.
    • Diverse redundancy – This produces perhaps the optimum level of redundancy in that each paired device may not be susceptible to a single cause of failure.

Learn More: The Importance of Intent-Based Networking for Distributed Enterprises 

Redundant and Resilient Data Centers


Up to now, we have limited the scope of our discussion to the device and component levels.  For some enterprises today, such as MSPs, cloud providers, and global companies, redundancy must be implemented beyond the component level.  In most cases, the data center is the biggest single point of failure. Recovering from a total failure due to a natural disaster cannot be compared to the task of replacing a failed power supply.  Elaborate disaster recovery strategies incorporate redundant systems involving multiple geographical locations. These secondary backup systems can provide quick and seamless transitions from failed primary systems.

Final Thoughts 


The ancient Greek philosopher Socrates said that “True perfection is a bold quest to seek.”  Striving to achieve the hallowed five-nines is no easy course. Then again, near perfection never is. Integrating the building blocks of redundancy and resiliency can, however, make it a whole lot easier. In the end, the hundreds of hours of preparedness will more than justify a window of barely five minutes of downtime.  

How do you plan to minimize the impact of downtime in 2021? Comment below or let us know on LinkedIn, Twitter, or Facebook. We’d love to hear from you!

Which SNMP version requires authentication and validation between managed devices and the network management console before messages can be exchanged?

The SNMP version 3 protocol introduces authentication, validation, and encryption for messages exchanged between devices and the network management console.

What mode allows a NIC see all network traffic passing through a network switch?

Setting a NIC to run in promiscuous mode will allow it to see all network traffic passing through a network switch. A system with an availability of 99.999% can be down for a maximum of 52 minutes per year.

What statement regarding the use of a network attached storage device is accurate?

What statement regarding the use of a network attached storage device is accurate? A NAS can be easily expanded without interrupting service.

What type of backup scheme only covers data that has changed since the last backup?

Incremental backups reflect only what has changed in the data since the last backup — whatever type of backup it was. This option consumes less storage space and time, but it also means a more difficult restore process.