Apr 25 2018 Sitting On an Outage Bridge? Here’s How to Get Off the Call
Regardless of the root cause, outage bridges are a fact of life in every organization. But they shouldn’t have to be.
It’s Friday at 4PM and the call comes in. You’ve just lost production apps—and it’s your customer letting you know. The clock starts ticking, the adrenaline is flowing, and you begin the process of dealing with a Sev1 outage on a Friday afternoon. No happy hours this week.
We’ve all been there: Stuck for hours on an outage bridge with no end in sight. Whether due to a storage issue, a server failure, or the worst, a full facility power outage, we gather multiple teams of people—application owners, engineers, vendors, and executives—and then slowly pour through the logs, sometimes all night, searching and waiting.
“What are you seeing?” and “Can you get this person/team to join?” are the two most frequently asked questions. Until an executive frustratingly asks, “Do we know anything yet?”
Hours (and sometimes days) go by looking for the issue, assessing the impact, and figuring out how to recover critical services. It’s tedious, time-consuming, and frustrating. And in today’s on-demand world, it can ruin consumer perception of your brand or service.
So why do we keep doing it this way? Because that’s how it’s always been done. But more importantly, because of a lack of overall planning and understanding of how applications and architectures interrelate. Or, because we’ve become dependent on fire fighting, rather than developing appropriate incident handling plans.
Outage bridges are a result of the reactive response to a tactical issue. A problem arises, most often through end user reports, and we try to figure out what is going on and resolve it. The larger the enterprise, the more people get involved, complicating and dragging out the recovery and resolution.
Whether traditional infrastructure, public/hybrid cloud instances, or IaaS/Saas—unplanned outages do occur. Rather than waiting for one to happen and then rallying the troops to try and figure out what to do about it, why not take a proactive approach?
Planning, visibility, practice and redundancy
Information security teams have been refining their incident handling skills for over a decade as the threat landscape has evolved. The sheer number of attacks has driven automation and robust planning for enterprises and managed security service providers.
The SANS Institute has been instrumental in helping to establish the proactive security team model through training and certification. They have developed incident handling guidelines that the broader IT organization could benefit from to build better incident response plans and more robust operations management. The methodology and phases can be used across other areas of IT: planning/preparation, identification, containment, eradication, and recovery.
Planning is the first step to getting ahead of any outage, but few companies invest adequately into building an incident plan. Poor planning results in excessive guesswork and response teams that are too thinly staffed or outsourced.
Planning is typically a task that falls to the operations team; the missing element is the partnership with application, architecture, and domain owners. The larger the organization, the more disparate the teams. Documentation is static and proactive sharing of end-to-end architecture is less likely. With many legacy systems, there must be a deep understanding of application anatomy and how it relates to infrastructure.
A good plan involves not only implementing the right technology, but putting the right people and processes in place, as well as defined policies, communication plans, checklists, and training. Documenting applications, systems, and architectures is paramount to understanding how to solve a problem anywhere in the ecosystem.
Next, comes visibility. You might have a great plan, but if you can’t see what’s happening, it’s extremely difficult to formulate a response and recover. Most organizations have multiple tools that perform individual tasks. Alerts happen after something has gone sideways. There are a number of tools available to monitor performance system-wide that not only help to identify the issue and solve the problem, but also help IT managers mitigate the impact of an outage.
Firefighters are good at fighting fires because that’s what they do all day long, but there are all kinds of things you need to know as a firefighter before you can get the water flowing. Training, tribal knowledge, and real-world experience hours help firefighters be safe, efficient, and good at what they do. Netflix’s Chaos Monkey is perhaps one of the most well-known and refined training and testing tools for outages in the industry. A failure in the company’s infrastructure or application stack is intentionally caused to test how teams will respond and understand how remaining systems will cope.
Continually practicing how they would respond to an outage also enabled Netflix, and others that use the tool, to build in the redundancy they need to minimize impact to production. Their knowledge, skills, processes, and tools are regularly tested to quickly address service issues or design better resiliency in the architecture.
Once recovery is initiated, the pressure is lessened, but until the cause is known, the issue could recur. The lessons learned throughout the outage need to be examined for areas that need improvement.
Traditional solutions won’t work in the cloud
Netflix—and other cloud-first companies—built in a framework from the beginning that would give them the visibility they needed, a plan they could practice, and adequate redundancy to keep them online even in the midst of an outage. As companies wrestle between the cloud, traditional on-premise datacenters, next-generation applications, and supporting legacy platforms, building resiliency and addressing outages is a challenge. These processes must be aligned before you move forward with any cloud migration, new resiliency models, or present new applications to end customers. Legacy processes and managed operations simply won’t suffice or scale in a cloud model.
Uptime is paramount. IT has become a utility to all users, internal and external. If you’re down, you’re losing something—money, customers, transactions, or worse, credibility. At Edge Solutions, we’ve sat in the war rooms working to solve major outages with customers in the Southeast. We have seen first hand what works, where the blind spots are, and how organizations of all sizes can benefit from a fresh perspective.
The solution to any outage is not throwing more money or bodies at the problem; it’s developing an incident response methodology that works for your individual organization. Being able to continually practice and test that methodology as systems grow and evolve builds a stronger IT organization. Over time, those incidents—those bridges—will become fewer and further between.
Look for Part II of this conversation where we will cover more around the plan, identify, isolate, recovery, and root cause analysis—along with how Edge Solutions is working with customers around this methodology. Tell us what you have seen, what works or doesn’t, and if you’ve found any unique ways to solve this long running gap.