This is a summary of the cause of the outage of our Manchester Cloud Servers on 15th January, its causes, and the follow-up action we're taking. We're not happy with the stability of our Cloud Servers platform at the moment, and both our Networking & Engineering teams are working on it as a priority.
We provide Cloud Servers in both our York and Manchester data centres. The outage affected all Cloud Servers in our Manchester data centre. Affected servers were inaccessible over the internet for approximately 45 minutes, though they remained powered on.
Other services in Manchester, such as premium servers, firewalls and load balancers, were unaffected. All services in York were unaffected (though the Bytemark Panel wasn't fully functional during the outage).
At approximately 15:10 on 15th January 2018, all network connectivity was lost to Cloud Servers in Manchester. Staff had also lost access to out-of-band management systems for the racks holding our Cloud infrastructure, and all other racks appeared to be fine, so we initially suspected it could be a power failure. We contacted our data centre provider, TeleData, and were told that there hadn't been any power-related incidents. A further, visual inspection by TeleData staff confirmed shortly thereafter that there were no power outages unique to these racks.
It became evident that the issue was limited to the access network switches that serve the Cloud services racks. By 15:40, our Network Engineers were able to identify that these switches had simultaneously placed their uplink interfaces into an error state, in response to unexpected network conditions. Following this identification, we were able to work on bringing the network switches back online. By 15:53, we restored network connectivity to all services.
The outage was around 45 minutes in total.
We are still investigating whether the network problem that initiated the outage on 21st and 22nd December 2017 took place due to similar network conditions within our Manchester network.
We've been working closely with the vendor for these network switches to pinpoint the root cause of their behaviour, which we believe is the most important issue in stabilising the Manchester network. Following the outage yesterday, we do now have some more useful information to pass along to them.
We expect that remedial work will be required to the affected network devices, and we will make a further announcement if this work is disruptive in any way.
The access switches that disabled their own uplinks did so in this specific situation to protect themselves and the network against other potential outcomes (eg, broadcast storm).
These switches are each uplinked to a pair of switches using a technique called MLAG (or MC-LAG), which is designed to protect against a complete forwarding failure by masquerading a single LACP partner, whilst synchronising MAC reachability information.
To reach the situation we encountered, it had taken two separate events happening together in concert:
The upstream switches experienced significant disruption to their synchronization, to the point where both the primary and secondary methods were compromised.
Access switches running Cisco IOS (specifically) responded by disabling their uplinks completely, lengthening the interruption to service.
Condition #1 precipitated the entire event, and this is where the focus of our ongoing diagnosis with the software vendor lies.
With regards to condition #2, we're hoping that we may be able to relax the response from these particular switches to the point where it should retain at least one of its uplinks regardless of the upstream network conditions. Once we've been able to research, test and deploy this migitation for condition #2, it may help to reduce the disruption to Cloud Servers should condition #1 occur again.
As the full root cause for the disruption seen as a result of condition #1 hasn't yet been ascertained, we will endeavour to update this notice with the information as soon as we obtain it.
2018-01-15 15:10 Staff alerted by monitoring system to problems with Bytemark Cloud in Manchester. Status post.
2018-01-15 15:35 Power failure ruled out.
2018-01-15 15:40 Problem with network switches identified.
2018-01-15 15:53 Network connectivity restored.