Resilience - when it matters most!
Event Date: 03 September 2018 : By Richard Auld
The recent events at Gatwick saw staff responding brilliantly to the network outage that left them with no departures boards. Gatwick Airport Limited took a lot of criticism for having to revert to a person with a walkie talkie and a whiteboard. I bet there wasn't a plan for this but the staff managed to cope. However, why should they have to do this? Doesn't a global infrastructure company like the one that owns Gatwick not have a resilient network to support a 24x7 operation?
The fact that is has been reported that the departure boards only need 3Mbps of bandwidth to work makes it even more criminal that they went down. This amount of bandwidth could have been provided by a 4G SIM or an ADSL broadband.
At Next Connex, we use something called "Intelligent Design" to create truly resilient networks for our customers. This could be two diversely routed leased line access circuits, a leased line and a broadband link, or a leased line and a 4G SIM. Resilient designs rely on some serious investigation of access routes, local carriers and duplication of hardware and power supplies. For a truly resilient solution, you need to have:
- At least two of everything;
- Physically separate routes to different core nodes; and
- An efficient switching protocol in the network to re-route traffic when there is a problem.
Here are some of the less obvious failures in resilient design that we have experienced.
- Two routes, resilient equipment - but the ISP has only one Tier 1 IP transit provider.
- Hardware with two power supplies - but plugged into the same Power Distribution Unit (PDU).
- Multiple routes - but all at Layer 2 with network complexity eventually leading to a major network crash.
- Multiple routes - but both terminating in the same rack in the core node data centre.
- Two different carriers - but they happen to share the same dark fibre provider or the same duct for a section of the route - one cable cut and they both go down.
- A "cloud" provider that is really a single data centre - clouds should have multiple locations and N+1 resilience.
Next Connex has a triple resilient core network using three independent core nodes, three global Tier 1 IP transit providers and MPLS self-healing network that runs over at least three routes between the sites. If 2 is N+1, then three is better.
We also produce a diversity report showing customers where their access circuits will run and identifying pinch points and options to remove them. This involves feet on the street (often in the rain) and some clever desktop planning tools.