Extended Downtime Friday Morning (US)

Late Friday night - at least for me in Vietnam - I was jet lagged and already asleep when I heard my phone ring. Not completely aware I didn't think much of it, but then a second ring and I knew it was urgent so I grabbed my phone to check the systems. To my surprise all of our customer facing websites were down.

First line of defense

We aim for 99.95% uptime on all of our web properties. So when something goes down we want to know ASAP so we can remedy the situation. To help us stay on top of things we use the Pingdom service, which has hundreds of servers around the globe honed in on your sites every second. Should one of the sites go down, it waits for a couple more servers to confirm the downtime then alerts our team.

Several team members, myself included, have a special alarm to alert us to any downtimes via the Pingdom app.

What surprised me on wasn't the phone call late at night. I'm used to that! Rather, I was surprised that my Pingdom alarm didn't work. In fact, when checking Pingdom it appeared that all sites were live even though they were not.

This time was different

While I don't know what was different Pingdom said the issue was due to a configuration problem. In the past this problem didn't occur, but perhaps something on their end had changed.

As you may have noticed, when your website is down we give you a link to this Status site so you can see what's going on. As our engineers are working on any situation we document changes there so you have an idea of when we may be back.

The issue was that this page we returning a status code 200 OK. This (as of recently) told Pingdom the site was live, not down.

We've adjusted this so now any time we're down and you're redirected to the Status message we'll return a 500 Server Error. This will communicate to Pingdom that we are in fact down.

OK that explains the slow response. What cause the issue?

Once we discovered the downtime we quickly realized that one of our clustered drives went down. The short term solution was relatively simple, get the clustered drives back online. We were able to do that quickly restoring all web sites.

Until we find the root cause, the medium term solution is to move the data of the clustered array that went down to a new array. We've already done this for the websites and are doing the rest for the integration services now.

Long term fix?

We're working with our clustered drive vendor to find out what happened and why the high availability didn't kick in. You see, hardware failure is to be expected in high performance computing. Our systems are design to accept many failures and still maintain performance and availability. But in this case the vendor's software failed to automatically rollover.

We are actively working with the vendor to accurately trace the root cause. We will update this postmortem when we find the cause and plan a solution.

Posted Dec 22, 2014 - 01:35 PST

Resolved

We're back. The issue was with our cluster manager which ensures if there are any hardware problems the system can remain up. It did not appropriately detect a disk issue and rollover to the other node. We'll work with the vendor over the next few days to identify the cause of this issue so it doesn't happen again.

Posted Dec 19, 2014 - 09:02 PST

Identified

We've found the cause and will have a resolution shortly. Sorry for the delay, our uptime reporting software failed us in this case. We'll identify why it didn't detect that issue and resolve it for the future as well.

Posted Dec 19, 2014 - 07:54 PST

Investigating

All websites appear down for me, but not to others. Sorry for the delayed response, our uptime monitoring services didn't detect this. We're investigating now.

Posted Dec 19, 2014 - 07:32 PST