From Good to Great: The Path to Improved System and Application Uptime

Mark Mincin, EVP & CIO, Epicor Software
46
91
17

Mark Mincin, EVP & CIO, Epicor Software

We can thank SaaS providers for raising the bar with regard to system and application availability. Today, while 99.5 percent system and application uptime are considered “standard” by most cloud providers; this equates to 3.42 hours of downtime per month—a significant speed bump in today’s always-on business environment. Just ask any unlucky airline passenger who becomes inadvertently stranded for hours due to a system outage!

Increasingly this “min” bar is being raised to 99.99 percent–a figure which equates to 38 minutes (or less) of downtime per month. The question many IT organizations face: How to meet this uptime imperative without significantly increasing costs? Ensuring the right disciplines (governing people, process, tools) are in place is a good place to start.

Start by Selecting the Right SLA

Higher level SLAs translate into higher costs. Not all workloads require the same level of SLAs; to this end, it’s important to balance business requirements with fiscal responsibility.

When and how do you need your system or application to be available? Keep in mind availability requirements may vary on the type of application and depending on the time of day, month or year, and whether access is across multiple geographies and overlapping time zones.

What are the financial/compliance implications of availability? Are there other dependencies with other upstream or downstream systems?

If your company is a provider of SaaS or hosting services, group your applications by criticality based on these considerations. If you are relying on SaaS providers for your internal needs, make sure you know what the SLAs are and how they are calculated—as well as the calculation of penalties.

 Monitoring tools generate an extensive amount of data-but how does IT separate what`s really critical and what`s just a false alarm? 

Unify Monitoring for Streamlined Detection and Troubleshooting

Operating in the +99.99 percent uptime world where time is of the essence places tremendous importance on problem identification and alerting capabilities to enable agile support and system/application optimization. Monitoring is about finding issues before they become problems and fixing them before users are impacted.

In a 2015 survey, industry analyst firm Forrester found 91 percent of senior IT decision makers at large North American firms responsible for application, network and/or service monitoring technology, cited problem identification as the primary area that needed improvement. Half of these respondents reported that 90 percent of their IT issues take more than 24 hours to resolve. As a provider of IT services, there is nothing worse than your internal or external customers calling the help desk to tell you the system is down and you didn’t know it.

Monitoring systems provide a view of the health and availability of applications and overall performance, the underlying infrastructure that they depend on, and detailed reports and analytics that can help with troubleshooting. However, IT organizations are often hampered by a lack of end-to-end visibility of their production environments–especially when they span different data centers, vendors, and even internal IT teams. Case in point: In the survey referenced above, nearly three-quarters of respondents cited using more than 10 monitoring tools encompassing Network Performance Management (NPM), Application Performance Management (APM) and log data analytics to discover issues.

An unwieldy number of silo-specific tools inhibit service triage activities. The modern computing environment needs centralized monitoring that provides a “single pane of glass” view spanning applications, infrastructure, and security. Fortunately, today many application monitoring tools exist, e.g., AppDynamics, that enable you to integrate infrastructure monitoring tools into a unified application monitoring dashboard, providing this integrated, single pane of glass view.

The Role of Automation and Machine Learning

The triage-to-fix process in this new world far exceeds human efficiency. Therefore, the key to meeting 99.9 percent uptime is to reduce reliance on humans and leverage machine learning to support higher resiliency in infrastructure and sophisticated monitoring and alerting to identify and mitigate issues precipitating downtime before they happen.

Monitoring tools generate an extensive amount of data–but how does IT separate what’s really critical and what’s just a false alarm? This is where today’s leading-edge technologies, such as automation and machine learning, can serve to improve system performance and drive down costs.

Today it’s not uncommon to have hundreds or thousands of systems and applications generating millions of log events per day. Today Big Data analytics approaches can detect anomalies and exceptions and isolate data that deviates from the model. This automation allows organizations to scale operations with fewer human resources creates a sensing, responsive, autonomic fabric that proactively detects performance anomalies, and saves troubleshooting time with increased guidance to probable root causes.

In today’s business environment, downtime is lost opportunity and lost revenues. Putting the right people, processes, and tools in place provides service assurance to help organizations build information technology that is highly resilient to deliver the requisite computing power and uptime businesses need today.