Post-mortem from recent outages, and an apology


#1

This post summarises a short pattern of outages we’ve had over the last month. We’re deeply committed to the reliability of our Cloud Servers platform and we understand that these recent outages will have caused you real business problems. We’ve identified the technical fault already and have started to roll out a fix. We’re confident that this will restore the levels of reliability that our Cloud customers have come to expect from the service’s last five years of operation.


What happened

On 11th April and 8th May our Cloud Servers platform caused servers in York to reboot due to a software fault. Due to the demand on the platform as a result, the servers took a long time to reboot, and were offline for that period. The two incidents were logged on our status site:

Both relate to a similar issue in November, which we thought we had fully addressed, but our fix needed further work. We didn’t put all the pieces together until both April and May’s outages brought the root cause to our attention.

We’re currently implementing a software fix across our whole cluster.

What this outage means

As Bytemark’s MD, I want you to know that this isn’t good enough. After personally answering calls at the time, I’m mortified at the trouble it has caused you and your business. I know that you build your online systems on the reliability of our Cloud Servers platform. I’m really sorry that we’ve let you down with these outages.

I’ve tried to build our business over 15 years on long-term reliability, and transparent communication around outages. I’m very proud of our team, who have kept customers informed, returned very many phone calls, emails and responded to people individually. We’ll continue to do that.

Our whole company has rallied around not just fixing one software bug, but looking at our outage management and communications under pressure. I’m proud of our handling of things thus far, and we’re coming together to build an even more sustainable outage management plan.

I’m happy to answer further questions here on the forum or Twitter.

Technical explanation

If you’re into networking and distributed systems, here’s the longer explanation, followed by some longer-term concerns around the growth of our platform.

We’ve identified the root cause of the outages as a subtle software bug in Ruby. This bug was always present, but we believe some code changes in November caused it to trigger more often. It wasn’t until April 11th that we saw the effect, and we weren’t able to track it down and patch it until May 8th.

The patch is done, but will take a few days to roll out, and we’re confident that it will restore reliability to the level you expect.

The BigV architecture paper goes into more depth on some terms below.

The continued up time of individual Cloud Servers is dependent on a long-lived TCP connection between the head servers (where VMs are running) and the brain (which supervises them, and fulfils the control panel’s requests).

These connections are designed to reset themselves in the face of short network outages without taking down any customer servers. Whatever the cause of an outage between cluster members, a resetting connection causes a loss of control for affected systems, for a few seconds. These have been invisible to most users most of the time. This was an intended and acceptable design trade-off in a distributed system.

Unfortunately, since some code and networking changes during 2016, these connection resets often triggered a bug in Ruby (the language that we use to build much of our our platform). This bug sometimes caused supervisor processes on the heads to crash with a segmentation fault, bringing VMs down with it. Its effects were unexpectedly exacerbated through some code changes in November.

On both the 11th April and 8th May the connections were broken long enough and triggered this bug across the whole cluster, such that all VMs needed resetting :sob: This process takes about 2 hours for all the servers in York, which is a frustratingly long time.

We spent some time looking at the issue but weren’t able to diagnose it as a segfault until the crash on the 8th May. Once we’d spotted that this hasn’t been logged well enough, and what the issue was, we were able to identify a known Ruby bug and work out a patch for it.

The fix is being rolled out across the cluster over the next couple of weeks that should make this condition vanish. We believe this fix will bring the cluster back to earlier levels of reliability, and we can work on network maintenance without worry that it will bring down all our Cloud Servers again.

I’ll confirm here when this work is complete.

Longer term concerns

In our previous explanation (under “Cloud Outage”) we stated that our fix was to tear out the connection mechanism and rewrite it on the back of message queues, as we have with many of our other internal systems. This is still ongoing, but the repeats of this issue have allowed us to pick apart and fix the problem for the short term.

We’re also rebuilding our the head’s supervisor processes with “decoupling” in mind, aiming to achieve consistency while keeping the whole system robust in the case of connection failures.

All of these changes should help us better cope with larger numbers of servers, improve their reboot times, and help us maintain our systems through live migration of running servers.

Any questions?

Again I’m happy to answer any further questions here or on Twitter on behalf of the team.


#2

While I am sure many customers will be a little upset, it really does highlight how reliable things have been up until now. We are so used to things just working with none of the hassles of years gone by. Before the Coppermine project was kindly hosted by Bytemark, I spent many long nights fixing things, now the system has been running without fault for over a year.

With technology getting ever more complex I am amazed everything connects up and runs so well. I work in a large datacentre containing systems which are part of the national infrastructure. Despite the massive number of highly talented staff, hundreds of millions of pounds of tax payers money and some state of the art kit we experience far more issues than the few hours per year you do, so do not beat yourself up too hard.


#6

This is technology - things break!

Hats off to you guys for your complete transparency. I got a notification from my monitoring provider that my servers were down; before I had even finished my email to support there was a status post. My email was then replied to within minutes. This is what differentiates Bytemark from companies we have worked with previously.

Keep it up :slight_smile:


#7

+1

I really appreciate getting all these details. There is a lot to learn here. Although the problem is all too well known. It’s the prize you pay for automating stuff. But automation is still good for both business and and product quality. You just have to figure out the perfect balance.

Keep up the good work!


#8

Outages are always annoying, repeat ones even more so but personally if I get a full explanation and I can see positive logical steps are been taken then I’m happy.

The alternative is like the outage I had today with another cloud company, whereby my cloud server went offline, I duly as I always do checked their status page, nothing, everything was green and there were no incidents. I held off logging a support call and kept an eye on the status page, 30 mins later, nothing. Then cloud server came back up but had no network, I could not ping the gateway. Finally I logged a support call and rather than do something logical they did a rebuild network on my server which due to a bug in onapp trashed my ipv4 network settings.

The didn’t see fit to update their status page as ‘The HV needed an emergency reboot and all the VMs have been booted back’.

As well as corrupting my network config, they tried to logon to my server, then I got told off for having a ‘custom config’ which doesn’t allow random addresses to ping or ssh to my server.

“Network rebuild is the first set we perform, if we find a server not pinging, when we see it is rebooted correctly and up.”

I’d much rather have a responsive company running my cloud servers, who are open about issues, update their status page, do a post mortem and don’t have support staff who blindly follow pointless support procedures.


#9

This bug fix has now been rolled out across the York servers, will confirm when Manchester is finished too.