This post summarises a short pattern of outages we've had over the last month. We're deeply committed to the reliability of our Cloud Servers platform and we understand that these recent outages will have caused you real business problems. We've identified the technical fault already and have started to roll out a fix. We're confident that this will restore the levels of reliability that our Cloud customers have come to expect from the service's last five years of operation.
On 11th April and 8th May our Cloud Servers platform caused servers in York to reboot due to a software fault. Due to the demand on the platform as a result, the servers took a long time to reboot, and were offline for that period. The two incidents were logged on our status site:
Both relate to a similar issue in November, which we thought we had fully addressed, but our fix needed further work. We didn't put all the pieces together until both April and May's outages brought the root cause to our attention.
We're currently implementing a software fix across our whole cluster.
What this outage means
As Bytemark's MD, I want you to know that this isn't good enough. After personally answering calls at the time, I'm mortified at the trouble it has caused you and your business. I know that you build your online systems on the reliability of our Cloud Servers platform. I'm really sorry that we've let you down with these outages.
I've tried to build our business over 15 years on long-term reliability, and transparent communication around outages. I'm very proud of our team, who have kept customers informed, returned very many phone calls, emails and responded to people individually. We'll continue to do that.
Our whole company has rallied around not just fixing one software bug, but looking at our outage management and communications under pressure. I'm proud of our handling of things thus far, and we're coming together to build an even more sustainable outage management plan.
I'm happy to answer further questions here on the forum or Twitter.
If you're into networking and distributed systems, here's the longer explanation, followed by some longer-term concerns around the growth of our platform.
We've identified the root cause of the outages as a subtle software bug in Ruby. This bug was always present, but we believe some code changes in November caused it to trigger more often. It wasn't until April 11th that we saw the effect, and we weren't able to track it down and patch it until May 8th.
The patch is done, but will take a few days to roll out, and we're confident that it will restore reliability to the level you expect.
The BigV architecture paper goes into more depth on some terms below.
The continued up time of individual Cloud Servers is dependent on a long-lived TCP connection between the head servers (where VMs are running) and the brain (which supervises them, and fulfils the control panel's requests).
These connections are designed to reset themselves in the face of short network outages without taking down any customer servers. Whatever the cause of an outage between cluster members, a resetting connection causes a loss of control for affected systems, for a few seconds. These have been invisible to most users most of the time. This was an intended and acceptable design trade-off in a distributed system.
Unfortunately, since some code and networking changes during 2016, these connection resets often triggered a bug in Ruby (the language that we use to build much of our our platform). This bug sometimes caused supervisor processes on the heads to crash with a segmentation fault, bringing VMs down with it. Its effects were unexpectedly exacerbated through some code changes in November.
On both the 11th April and 8th May the connections were broken long enough and triggered this bug across the whole cluster, such that all VMs needed resetting This process takes about 2 hours for all the servers in York, which is a frustratingly long time.
We spent some time looking at the issue but weren't able to diagnose it as a segfault until the crash on the 8th May. Once we'd spotted that this hasn't been logged well enough, and what the issue was, we were able to identify a known Ruby bug and work out a patch for it.
The fix is being rolled out across the cluster over the next couple of weeks that should make this condition vanish. We believe this fix will bring the cluster back to earlier levels of reliability, and we can work on network maintenance without worry that it will bring down all our Cloud Servers again.
I'll confirm here when this work is complete.
Longer term concerns
In our previous explanation (under "Cloud Outage") we stated that our fix was to tear out the connection mechanism and rewrite it on the back of message queues, as we have with many of our other internal systems. This is still ongoing, but the repeats of this issue have allowed us to pick apart and fix the problem for the short term.
We're also rebuilding our the head's supervisor processes with "decoupling" in mind, aiming to achieve consistency while keeping the whole system robust in the case of connection failures.
All of these changes should help us better cope with larger numbers of servers, improve their reboot times, and help us maintain our systems through live migration of running servers.
Again I'm happy to answer any further questions here or on Twitter on behalf of the team.