On January 31st, during upgrades at our Los Angeles data center, we pulled the wiring to a wrong network switch, taking 75 sites offline—sites that our clients’ businesses and users rely on.
Despite careful planning and frequent testing throughout the upgrade process, the data center had mislabeled two switches, and we pulled the wrong one.
As you can imagine, this was extremely stressful for everyone. We know how important uptime is to each of you, and the situation required very careful handling to make sure we prevented data loss while bringing everything back online.
To top it off, Peter and Brian, who were both onsite, had already worked through the night to ensure the least amount of disruption possible. This happened at the very end of the maintenance window, which meant they had to work through the next day on very little sleep.
In the end, we were able to start bringing sites back online within four hours, with the last ones finished at 11.5 hours, all without any data loss.
What we learned in the process is that while prevention is important, things will go wrong on occasion. (For us, it’s been more than 10 years since we’ve had something go this wrong). When that happens, it can feel catastrophic. But in the end, it’s your response that matters most.
1. Communication is key
A situation like this deserves our full attention with the goal of getting it fixed ASAP. In the midst of that type of focus, it can be easy to lose sight of the need to keep everyone updated on what is happening, what steps are being taken, and when we expect to have it fixed.
Lack of communication can exacerbate an already stressful situation, though, so we had team members updating our status feed on Twitter as well as replying to messages through a variety of channels throughout the day.
Proactive communication is key, and we tried really hard to provide regular updates. But responding to questions is also important. Emails, tweets, and Facebook messages can’t be ignored until later because that just increases the level of frustration that customers and clients experience.
2. Don’t shift the blame
When something goes wrong, our first response is often defensiveness. There’s a temptation to point out why the problem isn’t really your fault to protect your reputation and ego.
Rarely does this make things better!
In this situation, the outage was caused by labels that were mislabeled by the data center where our servers are located. This is factually correct and helps explain what went wrong so that we can prevent it from happening again in the future, but we still took (and continue to take) responsibility for the outage.
Our clients hire us to host their sites, not the data center, so we won’t shift the blame onto them. What we learned was that we need to double check every detail of each other’s work because that responsibility ultimately lies with us.
3. People value transparency
It’s tempting to gloss over mistakes or issues and focus only on the fix, but people appreciate knowing what went wrong, what steps you’ve taken to fix it, and how you’ll prevent it from happening again.
Taking the time to share those details, whether in social media updates, an email to those affected, or a blog post like this one, helps restore the confidence of your customers and clients.
4. Make it right
A hosting credit doesn’t undo the time that our clients’ sites were offline, but it does demonstrate that we understand the seriousness of an outage and that we’re committed to making it right.
How you “make it right” can vary from situation to situation depending on your business model, industry, the seriousness of the problem, etc. The important part is to focus more on the needs of your clients and customers than on minimizing the immediate impact to your business.
5. Learn from the situation
When things go wrong, we have an opportunity to learn from the situation to do better in the future.
For us, this meant identifying what had gone wrong (a mislabeled switch) and brainstorming ways to prevent that in the future (double-checking the labels rather than assuming they’re correct).
We discovered that we need a better system for communicating updates so that the engineers on site can stay focused on the fix rather than composing tweets to keep our clients updated.
During our internal debrief, we also took the time to highlight what went well during a less-than-ideal situation, both as encouragement to our team and so we’d have a record of those things if a situation like this ever happens again.
No one likes when things go wrong. In addition to the stress of trying to find a fix, there’s anxiety around what it will mean for your reputation and the future of your business.
But experience shows us that clear communication, taking responsibility, being transparent, and looking for ways to make it right can turn a difficult situation into an opportunity to build trust with your clients and customers. And there’s always an opportunity to learn from both the things that went wrong and the things that went well!