Bytemark had a series of incidents on our software infrastructure, network, hosting facilities and ancilliary systems last Tuesday, 11th April. These caused significant loss of service for many of our customers, including down time for both Symbiosis email and much of our Cloud Servers (BigV) platform.
There are a combination of major and minor lapses in service for our customers, and fixing the systems that caused them is currently the only work for the relevant members of staff.
There were five incidents which affected each other, so we summarise them all together, but each has its own set of remedial actions. I'm happy to answer any questions on any of these issues, and will reflect on our overall handling in a blog post later in the week.
Summary Customers running Bytemark's Symbiosis Linux distribution were unable to check email
Cause Debian security update for Dovecot
A new Debian security update was released for Dovecot, the IMAP & POP3 server which Symbiosis uses for email mailboxes. Debian has a history of stability, so it's very rare that routine software updates would cause any issues. Unfortunately, on this occasion there was a problem with the update that Debian had provided. This subsequently caused problems for Symbiosis, denying all mail logins. While waiting for Debian to release a fix for their security update, we released instructions on how to revert the software update. Later in the day, Debian released a fix and we published instructions on how to apply it immediately without having to wait for automatic system updates: https://forum.bytemark.co.uk/t/fixing-the-symbiosis-imap-login-issue/2602
Remedial action None - it is not practical or desirable to delay the automatic installation of security updates.
Summary A proportion of Bytemark's internal and external connectivity was interrupted for four minutes.
Cause Router fault
Between 0855 and 0859, some customers had reported a loss of network connectivity, specifically involving data centre services in Manchester. On further investigation by our network engineers, the outage was caused by a failure in one of the line cards of one of the two Manchester-based core routers (cr4.man, in Kilburn House). The router automatically reloaded the line card and service resumed without intervention.
Remedial action An emergency software upgrade for the router was scheduled for later in the evening, as a precaution against future occurrences of the same fault. This work was completed without any further effect to services. The detailed diagnostic monitoring for the routers was found to be working, but not alerting. This has been rectified, and a new monitoring canary will be added to avoid silent monitoring failures in the future. This diagnostic monitoring would not have prevented or shortened the outage, but would have allowed more rapid diagnosis of the cause after the event. The network team are also looking at whether configuration changes could be made to the core network to enable faster failover in the event of similar events, while not introducing instability into the network design.
Summary Initial failure of some Cloud servers in York, followed by a halt and restart of all Cloud servers in York
Cause Supervision software dealing inappropriately with an interruption to network connectivity.
NB Our BigV architecture document might help with some of the terms in this report
The network issue at 0900 interrupted traffic between the York heads and the BigV brain in Manchester. Three of the heads saw issues immediately and either disconnected from the brain or some of their cloud servers stopped running. The brain at the same time started logging errors trying to speak to the York heads. As part of a routine process to recover the connectivity to the affected heads the brain process was restarted. At this point all the head processes in the York cluster restarted and all the Cloud servers in York crashed. Over the next 80 minutes or so, all the servers restarted under the control of the brain until at 1140 all servers were running again.
Remedial action The code that controls the connections between the brain and the rest of the components is already being re-written since similar connection related issues have been seen before. In addition the code that runs on the head is being refactored to allow the head control code to crash completely without affecting the machines that are running. These projects will not be completed in the next couple of weeks, so additional code has been added to the brain and heads and tails to catch exceptions that weren't currently handled well - this should be rolled out across the cluster by the end of next week (22nd April).
Status page outage
Summary Customers were unable to view status page sporadically, staff unable to update it as often as they needed to
Cause Unprecendented traffic to the status page application
Due to the the series of problems that our customers faced above (particularly the outage affecting all Cloud Servers in York), traffic to https://status.bytemark.org reached unprecedented levels. The application was unable to cope with the load, which meant we couldn't make any new posts and customers couldn't access the site - it frequently showed 503 errors to visitors. This made it more difficult for us to communicate with affected customers, which left them in the dark. We used our Twitter account (@Bytemark) to communicate while fixing the status page application. We made some configuration changes to get the status page back up, such that we could then use it to keep everyone informed of the remaining issues.
Remedial action Increase the amount of concurrent connections the status page can handle, add caching in front of the application.
Support ticketing system outage and delays
Summary Intermittent and then full outage of support system, customer emails queued and not responded to.
Cause Only partially diagnosed - emails not responded to due to a change applied previously to incoming email routing
Our ticketing system's web interface became intermittently unavailable during the day. The cause of this is still currently unknown despite investigation during the day, further investigation is ongoing. What elevated the issue was that when the support system was down incoming email was queued, but not retried sufficiently frequently, meaning approximately 100 tickets were un-answered until late in the afternoon when the queue was discovered. A change had recently been made to the incoming email behaviour to deliberately queue mail rather than reject it if the support system was down. When this change was made no changes to the retry timers were made, so they default to retrying slowly, and also no alerting was set up to check for messages on the incoming mail queues that were destined for our support system.
Remedial action We are looking to accelerate the upgrade to a newer version of RT deployed in a different fashion to the current version. This will enable easier debugging in a similar situation. We will also tune the retry timers and set up alerting on our inbound mail exchangers for messages that are waiting on the support system to be delivered.