Blog

Tearing Down the Outage Bridge: A Deep Dive

Outages occur, but knowing what to do and learning from the problem can make the difference.

In Part I of this conversation, I discussed the frustrating experience of dealing with a Sev1 outage on a Friday afternoon—no fun for anyone! Outages will likely never entirely be avoided, but there are a number of things you can do to minimize the frequency and reduce the time it takes to recover to maximize uptime; whether that means protecting your brand image or creating a better experience for your customers to increase earnings.

Taking Inventory

The only way to effectively manage your IT environment is to take inventory of every aspect of the system. Too often, this is only done as part of mandated disaster recovery planning. Many companies large and small do not have an accurate picture of their IT system infrastructure and dependencies.

With smaller clients, we will sit down and create a map including all hardware, software, and applications purchased or home grown. Taking inventory of your IT system isn’t just a list of what you have, but where those things are located and how they relate to and communicate with one another through applications or operating systems.

The bigger the enterprise, the more complex and sprawled out the environment, which is when a Configuration Management Database (CMDB) becomes critical. The CMDB keeps track of what you have, where it is, and what version it is in real time so that you can understand how the various aspects of your system interrelate, and which ones can affect others—valuable information in the event of an outage. Running management systems that continually update the CMDB will ensure this source of truth is current and reliable.

Taking an accurate inventory of your system isn’t only about hardware and software; it’s also about knowing who knows what. There have been plenty of cases where a single architect of the system is the only one who truly knows how it works. If that person retires or otherwise leaves the company, it creates a significant gap in tribal knowledge.

As any company grows, creating an archive of individual knowledge is the only way to ensure it gets passed on. Creating living repositories of information using Slack or other archived channels of communication is the start of compiling info. It is important to make time to document and organize this information.

Minding the Canaries

In most cases, there are usually indications that an outage is imminent. It could be a small but steady increase in transaction round-trip time on a web based application to a row of machines in the data center that suddenly lose power. The key is knowing what those key indications are, how to read them, and what to do when they occur.

In many organizations, there are multiple tools used to monitor various aspects of the system, reporting and sending alerts to a kaleidoscope of consoles. Without the proper context and aggregation of data, it is very easy to miss warning signs of pending or ongoing issues. Add in subjective human analysis of each console or tool, and the result can be unpredictable or unreliable.

A centralized management system is the foundation for effective system visibility, monitoring, and alerting. It’s what you need to build if you want to get ahead of any potential outage before it occurs. Starting with facilities management of power, cooling, and other environmentals is first. If this is outsourced to a colo provider, great, but at least have a way to receive this information as close to realtime as possible. Hardware management is also important for both individual failures as well as recognizing dominoes beginning to fall. Finally, managing up the stack through the OS and Application should be as granular as possible and pull KPI’s.

One of the things we often do for clients is to look at the tools they’re using to monitor their systems and find the gaps and the blind spots. We then create a cohesive end-to-end system plan so that the client can make more intelligent, informed decisions before, during, and after an outage.

Building the Resiliency

A plan is not a plan without robust testing. And the only way to know if your monitoring system is effective is to simulate controlled failure. Based on what your tools are telling you, you’ll know how quickly your team can react and recover from a problem. Being able to determine whether or not you have the ability to isolate the issue so that it doesn’t create problems in other systems is important. The overall ecosystem redundancy to ride out an outage without it impacting your business overall is the end goal.

Resiliency isn’t only about how you recover; it’s also about how you prevent a particular failure from happening again. Once you’ve recovered, even temporarily, the pressure is lessened. Teams can sit back and take a deep breath, but then you need to dig in and perform root cause analysis.

This is when the process gets interesting, because as we get into root cause analysis, in many ways, we’re coming full circle in the process, starting the planning stage anew, taking inventory of what we have and what we know about what works—and what doesn’t.

How Edge can Help

We like to think of what we do as controlling the chaos. Creating an operations methodology rather than haphazardly running around in circles with silos of tools.

Sometimes, that’s as simple as opening up the flows of communication between groups. Other times, it requires strategically implementing technology to make a particular IT environment less complex: AVI Networks for better application monitoring and delivery or using BigPanda to aggregate events. There are many tools on the market to use for operations monitoring and management. There are no silver bullets. We want to give each customer the power to intelligently manage their systems on their own, on or off premise, and leverage technology to deliver operational excellence.

Simply outsourcing your operations isn’t the answer, and moving blind spots to the cloud won’t solve the issue alone. No matter if you have six applications or 600, it’s hard to be successful if you don’t have an accurate picture of your total IT environment, the ability to test your system, and built-in resiliency to maximize uptime.

For many executives, managers and salespeople, the IT world is a headache, but what it all comes back to is keeping your end customers happy. We want to help make life easier for our customers and provide a better experience for all of those consumers. For more information on how Edge Solutions can help, contact us today.