This post summarises the reasons and follow-up actions we're taking from the partial outage of our cloud servers platform on 21st and 22nd December 2017.
This outage affected a significant number of servers in our Manchester location which would have lost access to their discs, and been effectively "down" for the duration. Restarting the servers during the outage would have failed.
On 21st December at 1pm we saw reports of servers in Manchester going offline. Our team spent most of the afternoon looking at a temporary network issue affecting our Cloud Servers' storage network. It turned out that this short outage triggerred a software problem that kept servers offline, even after the network had recovered. This combination kept the affected servers down for up to 6 hours, since we were concentrating on fixing a complex network problem. Once we'd identified that there was a more urgent software issue, we were able to fix the affected servers quickly.
The network problem recurred at around the same time of day on 22nd. It had the same effect, but we were able to fix the affected servers now that we'd understood its effects.
To date we've not seen the network issue recur, but have some follow-up work to reduce the risk.
We are investigating the theory that there was a fault in the software, but this is not yet conclusive, and we're working with our vendor to provide more information. Since this only occurred twice, we've not been able to implement a workaround at the time of writing. To be clear, our storage network is vital to the reliability of our cloud servers, and is designed with a target of 100% availability.
However, our BigV software is fully under our control, and needs only a small fix to minimise the knock-on effects. We'll be making sure that our storage servers are more resilient to network outages and will prioritise a round of software fixes to minimise the effects of any future short outage of the storage network.
We're also fully aware of the effects of an outage right now, and can bring the same problem under control very quickly.
Or How could the outage of the service last longer than the outage on the network? A brief technical overview.
Our Cloud Servers connect over a storage network to their discs. They use the NBD, protocol which runs over TCP.
This network problem caused the TCP connections to stay held open, but in a useless state. The cloud servers try to reconnect after a failure, but this failure state persists for long enough that the servers end up with many useless connections, and more keep coming in. At some point, each disc hits a hardwired limit on the number of TCP connections, and starts to reject new connections from the servers.
There are several possible fixes to this issue. Since Bytemark are in full control of how NBD connections are made, we will put time into researching, replicating and working around this new failure mode. Our goal is to ensure that servers reconnect reliably to their discs after a network outage.
Again, this is a completely separate issue to why the network failed for a brief period but it's one we can solve more quickly and confidently while we look at our network implementation.
Timeline & links to issues