Campus HA and Resiliency
Last Updated: [last-modified] (UTC)
You want your network to be resilient, but it comes with a cost. While it would be great to have many levels of redundancy, it’s not always practical. Part of our jobs is to identify how much high availability we need, and what the business can do without.
Remember that the finance people are the ones to convince. I know that you don’t want to hear this, but this is where you need a business case.
In the business case, outline the worst case scenario. Point out the amount of down time that could happen without redundancy. Show what the outage will cost the company. Present them with the options, even if you’re convinced that they won’t approve it. If you don’t present the options and things go bad, it’s not their fault, it’s yours.
But that’s enough about business. On to the network design!
There are two parts to High Availability; Device Redundancy, and Protocols.
Have you ever heard the saying Two is one and one is none? This is talking about device redundancy. This means that you need to avoid a single point of failure. For example, use dual switches, with dual power supplies, connected to dual power feeds…
But this is pointless without HA in the protocols that run on the hardware. For example, think about HSRP, OSPF, LACP, and so on.
Layers of Resilience
Network-level resiliency includes redundancy in the topology (including physical), and control plane resiliency. This means using the hardware for failure detection, prevention, and recovery. For example, using stacking, multiple links, and so on.
This is where to use a Defence in Depth approach. This means using several layers of resilience. As an example, you may have many ECMP routed links. Also, you may also enable UDLD on the links to detect layer-1 failures.
Use a modular design in the control plane. One example of this is to use route summarization. Throttling can prevent overwhelming the control plane. The goal is to isolate failures to a single area.
This is providing resiliency at the device level. This includes dual power supplies, dual supervisors, SSO/NSF, and so on.
It also includes software resilience, including security features and control plane hardening. Overlooking this can result in high CPU load, TCAM starvation, and similar errors.
Consider using Control Plane Policing (CoPP), limiting flooding, and hardening spanning-tree. Also consider using QoS and Storm Control to prevent overwhelming the data plane.
This is about how you manage the network. In particular, think about change management and change windows.
Software updates also fall into this category. Some platforms support ISSU (In-Service Software Upgrade) or similar for non-disruptive updates.
Measuring and Monitoring
One way to measure availability is as the percentage of time that the network is available. If you’ve heard of people talking about ‘four nines’ or ‘five nines’ of uptime, this is what they’re referring to.
To put this in perspective, four nines, or 99.99% uptime allows for the network to be down for 52 minutes per year. 99.999% uptime allows for about 5 minutes of downtime per year. If you’re ambitious, you can aim for 99.9999% uptime. This allows for only 30 seconds of downtime per year.
Another form of measurement is Defects Per Million or DPM. This measures the number of errors per million hours of runtime. This is more often used in large networks.
The way to count the million hours can vary. A simple method is 1 million hours of runtime. Another way is to add the runtime of all network devices together. Or, for a more user-centric perspective, this can be a million hours of user time.
Resilience in the Hierarchical Model
In the access layer, host devices are singly connected. Your workstation, for example, has one NIC and connects to one switch. At this layer, it’s best to aim for high availability within each switch.
Using redundant power supplies is an obvious choice. Also, use separate power feeds if possible.
Use multiple uplinks from the access layer to two distribution switches. This avoids switch isolation if a link fails. You will need spanning-tree or dynamic routing in this topology.
It may be surprising that redundant supervisors is most useful at the access layer. Running SSO and NSF reduces the risk of total switch failure.
An alternative to a chassis is a stack. For example, you may decide to have a two switch stack in the wiring closet. You can put half of your devices in one switch, and half in the other. If one switch fails, it only affects half the connected devices. Move any critical devices to the other switch until you find a resolution.
Redundant supervisors are an option. SSO (Stateful Switchover) allows the supervisors to share MAC addresses. If one supervisor fails, the other already has a full MAC table. NSF is similar, but for routing. If a supervisor fails, the routes aren’t dropped while the neighbour relationships reform. This reduces route flapping, and the need to relearn every route.
But surprisingly, SSO and NSF aren’t always a good option here. SSO takes 1 – 3 seconds to recover, which also slows down NSF. This is because the switch needs the MAC address of the next hop. It may be better to use ECMP and tuned dynamic routing.
If the access layer is not routed, the distribution layer needs an FHRP. Remember to tune this so end hosts do not feel the impact of a distribution switch failure.
The core is all about high-speed transport. You may want to avoid SSO/NSF, and use many high-speed links with ECMP.
Along with ECMP, use diverse paths. This is noticeable when the network runs between buildings. Cutting a fibre run is disastrous if all links take the same path.
For the most part, though, the same principles apply to the core and distribution layers.
Marwan Al-shawi and Andre Laurent – Designing for Cisco Network Service Architectures (ARCH) Foundation Learning Guide: CCDP ARCH 300-320 (ISBN 158714462X)
Chris Oggerino – High Availability Network Fundamentals (ISBN 1587130173)