18 August 2016
In this world of instant gratification, when waiting more than a few milliseconds for a webpage to load causes us to huff and puff, system uptime is paramount to the success of any cloud business, but how is this done? Can we make the business so resilient that even the smallest glitches go unnoticed by the consumer? Consumers don't think about the infrastructure that keeps a system running, and in reality, when they can't access a service, the brand will be at fault -- not the company powering the website or service. 

Here is a great example. Have you ever had your flight delayed because the airline's computer systems went offline? On August 8, passengers on Delta's international and domestic flights were grounded because of a system failure. My flight was grounded for the same reason a few weeks prior to Delta's worldwide outage. 

Being in the business of IT operations, I could accept that these things happen, but other passengers won't be so happy and accepting. It got me thinking that if this system failure could happen to a major airline, it can happen to any business. No company is immune to system glitches and unplanned downtime. What it really comes down to is how efficiently companies insulate themselves from these unplanned system issues. 

I work for a B2B company that's an omnichannel provider. Our customers largely service consumers (B2C), and I often put myself in their shoes. I do this because I need to understand the consumer experience. My goal as a service provider is to ensure that consumers have the best experience possible from their brand (my customer). Key things that I focus on are the consumer interactions, whether via voice, chat or email. I have to look at who the consumer is, the service I'm providing and the customer experience for that product -- brand loyalty is ultimately what's at stake. 

The main challenge is to prevent customers from experiencing system glitches. Early detection using standard monitoring solutions, self-healing capabilities, and automated system or application updates are critical to maintaining uptime, but are these enough? Well, not always. The human factor has to be considered. No matter the level of procedures and governances that are in place, there is still the risk of someone "fat fingering" a change or simply entering the wrong command. We have all had this unfortunate experience -- even the best of the best make this mistake, but nonetheless, adherence to change procedures and practices, like service restorations, must be commonplace.

Another component that I like to call the "illusion of uptime" is when you orchestrate the infrastructure, applications and supporting systems to give the appearance of uptime. This allows for even the smallest of system glitches to be logged and alerted on, but recovery happens so quickly that the customer/consumer never notices. 

Think of the experience when you're watching something on Netflix, for example. You're streaming a movie, and all of a sudden, the sound stops. Most of us would assume that our ISP is having an issue. How do we know for sure? Well, we don't. Netflix could very well be having an issue, but they have engineered their systems in such a way so that they can apply thousands of changes without consumers ever knowing -- this is the illusion of uptime, my friends. 

Many companies outsource the management of their critical business applications, phone lines, chat, email and so on so that they can focus on their core business instead. Since a company has to be there for its customers around the clock, cloud companies are best equipped to keep up with uptime. Here's how they do that.

Redundancy
More and more, we're moving to this environment of no downtime or the "Always On" component. Having zero downtime as you do maintenance to your environment is almost impossible, but you can create the illusion of uptime through redundancy. If you have a highly scalable system, you can take one component out of service without interrupting the entire system or flow of data. 
If system failure could happen to a major airline, it can happen to any business.
Share this

There are challenges with redundancy, especially when data synchronization is required. Being able to have data replicated fast enough to a redundant system and site is not impossible, but there will likely be expected delays, and most often, the data won't be 100% synched. Fail over to the redundant system must be seamless and provide the least amount of impact to your customer. Having system redundancy, building the redundant paths and having a solid disaster recovery site help to prevent a single point of failure at any time. If your system does go down, you want to be able to seamlessly bring up your secondary site as soon as possible.

These days, many traditional SaaS and Cloud service companies are unable to scale big and fast enough to keep up with the pace of growth. Building redundancy can be challenging at best. It's become commonplace to use another cloud hosting company, like Microsoft Azure, Amazon or Oracle, as the backbone for a platform, virtual machines, databases and storage. 

When I look at these cloud hosting companies, I know that they will allow me to expand or contract my system within minutes while guaranteeing system availability.

Cloud Hosting
Managing the hardware and software where the system resides is the biggest capital expense for most SaaS or cloud companies. The lure of cloud hosting has become quite popular in the past five to 10 years. These services aren't cheap, but you're paying for the convenience of being able to add services to your virtual cloud infrastructure in five minutes or less, for example, while being able to focus on your core competency of providing service to your customers through your application or offering. In addition, a cloud hosting provider has a significantly reduced total cost of ownership and reduced overhead of maintenance because you don't need to employ people who must be at the data center at any given time.

Consumers don't ever see the cloud service provider though. They just want to be able to access the brand at any time, and if that brand's system goes offline, they complain to the brand. If our cloud hosting company that's behind the scenes goes offline, we're on the hook -- not the cloud hosting company.

From a company perspective, I'm closely partnered with my cloud hosting provider. We're constantly striving to improve the service together by finding and breaking down problems. I have to be completely aligned with this company since they're providing the components that run our software. Collaborating with the cloud hosting company is important since the cloud service business has to meet certain thresholds. A brand may lose money if its system goes down, cloud service companies may have to pay significant fines back to their customers if they can't maintain enough uptime.

Procedures
I look at our redundancy and whether we can expand and contract, as well as where we can build a disaster recovery site. I have to keep the business running no matter what. The human aspect comes in because solid procedures are necessary.
DevOps is a race to automate everything.
Share this

Whether just for one particular customer or an entire site, you have to be very nimble about pushing changes. When you're pushing thousands of changes per month, week or day, if one of those changes causes an issue or an unscheduled event like downtime, it has to be caught prior to any impact being realized. 

To successfully push out changes, you need a replicated copy of your environment in a nonproduction state so that you can test and vet changes before they're pushed to the live system. We want to think that anything we test in a nonproduction world will work in the production world, but having an exact replica of the production environment is near impossible. You can't replicate every customization in your production environment so you only include the most common customizations in your test environment. We want to test the system in various stages as well, like development, QA and staging. 

These customizations can seem endless when you think of the different paths a customer can take in the system. You don't want an upgrade to break a business process, which could create a catastrophic event for the customer's system.

Self-Healing Capabilities
Building intelligence into your system makes detecting and categorizing certain issues possible. If there's a space issue and you're low on capacity, for example, you can build artificial intelligence (AI) so that the systems auto extend storage rather than a human. This way, we're able to deal with repetitive issues in a way that takes the human out of the process.

A human does have to write the AI code that looks for certain conditions, but the self-healing approach gives us that continued illusion of uptime. We're able to save time and resources by having an intelligent program look for and fix problems in about a minute without service interruption, which is much less time than the 20 or more minutes a human might take.

Discipline
When you look at change procedures, while painful to do, you need the proper checks and balances in place so that one minor error doesn't take the entire system down. Ultimately, you want to build a seamless DevOps infrastructure to automate of all your changes. This eliminates the possibility of human error. You want all the checks and balances built into your automated scripts. 

Keeping up with uptime requires solid processes, redundancy, a great cloud provider, self-healing capabilities and automation. Transitioning into a DevOps culture is the ultimate path to an interruption-less environment. 

DevOps is a race to automate everything, which begs the question: How do you create that real life easy button?
 
Tap to read full article