Over the last four months we have not provided the level of reliability on our Cloud platform that we have wanted to, and that customers have come to expect. There have been a number of factors involved and, now that we’ve sufficiently got to the bottom of them, we feel it is the right time to post an explanation and status update. This write-up covers outages from 21st December 2017 until the start of April 2018.
Cloud server: A customer’s virtualised server running on Bytemark’s Cloud platform.
Head: A physical server which runs a number of Cloud servers.
Tail: A physical server that provides disk storage for the Cloud servers.
Brain: The service that provides the public API and database controlling where Cloud servers are running, starting/stopping, changing their specifications, etc.
At the end of 2017 we completed a move of data centres in Manchester in order to focus the provisioning of Dedicated servers to data centres we own (York), and having the Manchester location as primarily an instance of our Cloud platform. This work involved provisioning a completely new data centre network with a new vendor and network operating system. Secondly, we launched a new Cloud backup feature last year; behind the scenes, this involves taking snapshots of Cloud disks and migrating those snapshots to separate hardware. Finally, at the start of 2018, the Meltdown and Spectre security vulnerabilities (affecting a wide range of systems) were announced, which means we had to make rapid changes to our development plans in order to mitigate these vulnerabilities.
Review of individual availability affecting issues:
There are three main issues that have caused the outages of the last few months. I’m going to address each issue separately, rather than chronologically going through each issue mentioned on https://status.bytemark.org.
Just before Christmas last year we had two outages within a couple of days. They appeared primarily to be a feature problem with the operating system on the new switches, related to the way it handles the load balancing of the links to the storage network on the servers running the Cloud platform. Our Network and Platform teams spent large amounts of time trying to diagnose the failures, working closely with the network operating system vendor and hardware vendors to identify whether it was a design issue, a bug, or a hardware fault.
After that particular outage was over, we left additional diagnostics switched on in our data centre network to hopefully gather more information should the problem reoccur. Over the next couple of months there were occasional fleeting recurrences but nothing that provided sufficient diagnostic information. In the middle of March we had a long and sustained recurrence starting on the 20th until resolution on the 23rd. We spent a long time escalating the issue with vendors, and while the recurrence “felt” the same as the previous ones, the symptoms didn’t match exactly and we couldn’t explain it.
Eventually, the cause came to light. We had noted that the new data centre we were in was colder than the one we moved from (or indeed our own data centre in York). While diagnosing the issue, we visited the new data centre and noticed how unusually cold it was. Looking at temperature graphs from some of our equipment in there, we noticed a correlation between the temperature of the data centre and when the network issues were happening. We worked immediately with the data centre supplier to alter their cooling configuration to raise the temperature to a more normal level and the problems stopped occurring. The optical transceivers (lasers that light the fibre optic cabling between switches and servers) are specified to work between 0 and 70 degrees Celsius, but it appears that at the lower end of that tolerance they start generating errors (from 20 degrees down to around 13 degrees).
The above is a very short and greatly simplified account of what happened. Given the eventual and apparently simple cause and fix for this issue, it is worth mentioning that we also have a much longer and complete RFO for this issue which covers all of the debugging work that was undertaken. This will be available here by the 20th April.
* Periods of connectivity loss or packet loss to some or all Cloud servers hosted in our Manchester DC.
* Inability to access the Cloud server API (and therefore the Bytemark Control Panel and Bytemark Client) - limiting the ability to start/stop/change Cloud servers in both sites.
* Replace the optics with ones from a different manufacturer: Key optics have already been replaced, and remaining optics will be replaced by the 20th April.
* Install environmental monitoring and alerting in the cage which houses our equipment. Target date of 27th April.
* Work with the data centre provider to understand further why the temperature in the data centre fluctuated as much as it did. Target date of 1st May.
We have uncovered two bugs in the software running on our tail servers that can together end up making them unable to serve disks to newly created or existing cloud servers that are restarted.
- We found that the disk process manager was not correctly restarting failed disk migrations leaving them in a stalled state and preventing further migrations from starting.
- Additionally, we found that this disk process manager could then get stuck if it received a command to remove one of these stalled disk migrations, ignoring further updates from the brain. This has the effect of preventing Cloud servers with disks on that tail from starting.
Prior to the launch of the backup feature these issues were not commonly triggered as disk migrations were infrequent and manually initiated. Now that there are automatic migrations happening all the time the bugs are being triggered more often.
The problem typically affects one tail at a time. When it does it prevents tails from making disks available to newly started servers. Servers which are already being served by the tail in question remain unaffected.
* Complete deployment of updated flexnbd around the clusters, which should limit the number of queued migrations on tails. Currently in progress, target date of 20th April.
* Complete diagnosis of second bug that causes tails to become unresponsive to further requests. Currently in progress, target date of 4th May.
As part of addressing the Meltdown and Spectre vulnerabilities that were revealed in January, we accelerated the already planned upgrade of all the heads to the most recent stable release of Debian GNU/Linux, along with new code to supervise the user-accessible SSH servers that provide customers with access to the consoles of those Cloud servers. This code moved the supervision of the SSH processes to an external process supervision daemon instead of using our own code. We believe it is this code that has occasionally triggered a problem with heads where the failure of one Cloud server can impact the running of other Cloud servers and make it hard to start servers back up again.
* Individual heads at any one time affected, where a random cloud server may be stopped following a failure when starting another cloud server.
* We are rolling back the code on the head to move the process supervision back to our own code. We expect this work to be completed in the next two weeks.
We have had another couple of issues over this period that have not been as severe. We have had a couple of large Distributed Denial-of-Service (DDoS) attacks (the most recent being in excess of 30 Gbps) and they have affected the network as a whole. We react to these as swiftly as we can in order to mitigate them, and are always looking for better ways to do so.
There have also been performance issues with Cloud servers running OpenBSD since the mitigations for the Meltdown vulnerability were put in place. We are working on rolling out new virtual CPU profiles, new versions of qemu and microcode patches to our host machines that will fix the specific issue that affects OpenBSD, as well as mitigate the Spectre vulnerabilities. These new CPU profiles should also offer people increased encryption performance as they will expose the AES-NI instruction set from the host CPU, and also make live migration more reliable as they should guarantee exact virtual CPU matches between any two heads in the cluster - previously we had been running a generic 64-bit x86 virtual CPU profile.
The start of 2018 has been challenging for us with a number of different issues coming together at the same time, both internal and external, which has resulted in the platform not running to the high standards that we set ourselves and you have come to expect of us. I now believe that we have understood and have either addressed or are in the process of addressing all of the issues encountered. I would like to apologise for the interruptions in service you may have encountered and thank you for continuing to work with us.
We have more detailed information about each of the issues. If you would like specific details then we are more than happy to provide; drop us an email at firstname.lastname@example.org.