This incident report is a little late, so we apologise for that!
Around 150 of our Cloud customers were affected by a significant outage on Saturday, 22nd April 2017. For around 15 hours, these Cloud Servers were mostly offline/unavailable.
While trying to fix the problem, other Cloud Servers whose virtual drives were running on the same hardware also faced several reboots and degraded performance.
We sincerely apologise again to all of you that were affected.
Issue #1 Virtual drives for some Cloud Servers became unavailable, meaning the Cloud Servers were essentially offline/unusable
Cause Critical failure in the filesystem technology we use (called Btrfs) to store your virtual drives
When Approximately 12:00-03:00
Issue #2 While trying to fix the above problem, some Cloud Servers were rebooted several times and faced degraded disk performance
Cause Manually triggered reboots of the hardware
When At least 3 reboots between 12:00-23:00, and around 30-60 minutes of degraded performance after each reboot
We encountered a novel failure case, which we now understand fully and have put into place appropriate monitoring across our infrastructure to prevent this from recurring.
If you're into Linux system internals, here's a longer explanation.
Your Cloud Servers run on a combination of hardware called "heads" (CPU and RAM) and "tails" (discs/storage). Tails have multi-terabyte RAID 5 arrays of SSD drives for our standard-grade storage, and RAID 1 arrays of hard drives for archive-grade storage.
Tail 21 holds two standard-grade storage pools (ie, two RAID 5 arrays of SSD drives). Each pool is formatted with a Btrfs filesystem and holds many virtual drives.
One of the storage pools on tail21 became read-only at 12:00. We didn't initially know why. The other storage pool was unaffected. In an attempt to restore service as fast as possible, we rebooted the tail. This would have caused a brief outage for the good pool, but hopefully restore service for both pools within minutes.
Unfortunately, for about 30-60 minutes there was extremely degraded disk performance due to the massive load on the tail, but at least the bad pool was online again. We attempted to live migrate virtual drives away from the bad pool, but it became read-only again not long after.
The filesystem was behaving as if it was full, even though it didn't actually appear full. At this point we noticed that Btrfs metadata use was pretty much at the limit. Btrfs uses most of the available space for data, but reserves some space for metadata. If metadata space becomes full, Btrfs becomes very unhappy.
Unfortunately, Btrfs rebalance operations (to free up metadata space) failed and wouldn't complete before the filesystem became read-only again.
We monitor all filesystems in our internal infrastructure. An on-call engineer is paged if any become too full (or run out of inodes). However, we haven't been monitoring Btrfs metadata use, as it hadn't crossed our mind as a possible failure case.
Our problems were compounded when the cluster interaction between the central database and tail21 failed, due to a failure case we'd not prepared for. Frustratingly, this prevented migrations from working and it took a few hours to resolve that. Once we'd decided the bad pool couldn't be fixed, and had resolved the migration issues, we started to migrate drives away from the good pool. After a few more hours, the good pool was empty.
Finally, we created a fresh Btrfs filesystem on the good pool, copied all of the virtual drives from the bad pool onto the good pool, and brought those virtual drives back online by around 03:00.
A few of the customers affected also had subsequent problems with their filesystems, most likely due to a combination of the bad pool running out of metadata space and the unexpected tail reboots. Most were resolvable with a forced filesystem check/repair. One customer had to perform a partial restore from backups, and another customer had to perform a full restore from backups.
We encountered a novel failure case in the filesystem technology that we use. While troubleshooting the outage, we developed an understanding of this failure case. Armed with this knowledge, we've rolled out additional automated monitoring (of Btrfs-specific filesystem statistics) across areas of our infrastructure that use Btrfs to prevent this failure case from happening again.