Resiliency Vs. High Availability

joerg_nalik · ‎08-22-2012

SAP’s offerings today include so called On Premise (OP) applications, meant to run inside enterprises’ datacenters and On Demand (OD) applications, which are Software as a Service (SaaS) cloud solutions of SAP. The great opportunity ahead of SAP’s customers is to integrate their existing OP solutions with new, cloud based, functional extensions. A common concern of cloud usage is their High Availability (HA) and how and if it will be specified through service level agreements (SLAs). Unplanned cloud outages from big public cloud providers always find a lot of attention. Therefore, I’d like to make some simple comparisons of HA of cloud solutions vs. HA of enterprise datacenter applications.

Unplanned failure situations are hopefully very rare random occurrences in either case. They are looked at with statistical methods. The ratio of unplanned down time / total time per year is the typically definition of availability. Availability of a certain component, step or other element might be 99%, meaning in average the application runs on 99 days out of 100 days. Sometimes, availability is even promised to be 99.999% . For such “5 nines” availability the unplanned downtime would be only 86.4 seconds within 100 days of operation. It takes some effort and budget to make applications really highly available, so HA and Total Costs of Ownership (TCO) are a tradeoff decision. Each unplanned downtime causes losses to your business and therefore higher TCO for higher availability should be balanced against the losses from expected unplanned downtimes.

Now, how do OP and OD solutions compare in regard to HA? Here a 1^st example:

Let’s assume there are 500 On Premise customers. Each of them made the investments to have their applications 99% available.
Then assume 500 customers are using an On Demand solution, which also shall be specified to be 99% available.

Do both cases have the same level of availability? The answer is yes, but there are still differences, if you ask some follow up questions:

How likely is it that all 500 OP customer systems are up and running well at the same time?

To answer this questions one need to know that individual probabilities need to be multiplied to calculate the overall probability. So the overall probability that all 500 OP customers are running well is:

0.66% might strike you as surprisingly low, but trust me that is the correct number. It clearly means that almost never all 500 customers are running well simultaneously besides their individual 99% availability. How does this make sense? If availability is 99% it means that 1% of 500 customers, so 5 customers are down in average and 495 customers in average are running well.

In contrast 99% of the time all 500 OD (cloud) customers are simultaneously up and running well! Sounds much better than the 0.66% in the OP case but it isn’t really. When the 1% OD downtime hits all 500 OD customers experience downtime simultaneously, much more than the average 5 out of 500 customers in the OP case.

So the real difference of OP and OD solutions is that OP downtimes usually are a trickle which never makes the news, vs. OD outages are being a short sharp pain to everybody, which since it is rare draws attention and makes it into the news headlines. In terms of damage avoidance, both cases in this example are equal.

In a second example I’d like to look at a really complex business scenario which might consist out of many “Things” which can go wrong:

There are numerous hardware components like servers, network routers, firewalls, load balancers, storage …
There are multiple software components, like applications: ERP, CRM …, middleware components like PI, databases …..
Not at least there are multiple processing steps in an overall business process, which all need to function all the time.

A large number of “Things” all need to function simultaneously to complete a business process. So that we can reuses the math from above assume there is a total of 500 “Things” involved in a complex business scenario. Each individual thing would be a potential “single point of failure” making the business process to break as a whole it breaks itself. Therefore, each thing needs to be considered in regard to availability of a business process.

If you'd have 99% availability for each of the 500 things the above calculation applies in the same way as before. The chance to run through a business scenario with 500 things of 99% available components and steps is only 0.66%. So practically, performing the whole scenario without at least one error almost never happens! And: this is true for OP and OD solutions alike because I only assume 500 things involved and didn't make any assumption if they are OP or OD provided. Having complex business scenarios almost never succeeding is obviously a big problem and there are two solutions for this problem:

First, you could increase the individual component availability and step up HA from 99% to let’s say 99.99%. Then the overall success rate becomes:

This would be much better. If almost 5% failure rate is still too much just add another 9 to get “5 nines” (99.999%) availability, but remember that drives up your TCO even further. The High Availability way to get to 99.99% availability if individual things have 99% availability is to provide a redundant setup like shown in the picture below. Since availability is 99%, the un-availability of one component is 1%. For 2 redundant components un-availability probabilities have to be multiplied for overall un-availability:

So with the HA redundant set-up you can increase the availability of things tremendously. Two 99% available components together have a 99.99% availability. However, you’d need to double everything and therefore your TCO will about double as well.

Figure 1: High Availability is usually achieved through redundant set-up of each component such that any single point of failure is avoided. Special attention needs to be paid to x-shape inter-connectivity of components A and B so that every single failure of component A or B can be bypassed without loss of functionality. High Availability about doubles TCO compared to none HA systems. Resiliency set-up avoids TCO doubling but need more investment into inbuilt error recovery mechanisms, see text.

So what is the second alternative? It is “Resiliency”, which is the ability to recover from temporary failures or through some explicit error handling and error correction. Like before, in the 99% availability case only a small amount of steps will fail in average when performing a business scenario. You’d pass in average 495 “Things” successfully and only 5 will go wrong in average.

Let’s look at one failed step: How likely is it that it would fail twice when executed two times in a row?

As before, if you perform a step a second time if you experience an error the first time, overall availability rises from 99% to 99.99%. A minor catch is that about 1% of the things need to be re-done and that increases overall processing times roughly by 1% in average. In most cases this is far too little to matter at all.

A bigger issue is that for each thing you’d need to implement re-try capabilities and that has some pre-requisites. The first prerequisite would be that you link things together to a wholesome business process “loosely coupled”. By that I mean that you built in mechanisms to re-try a failed linkage at some later time. If a link partner is not reachable then its 99% availability tells you that it should be available again after a short time of waiting. If the things are network routers connected to each other some network protocols do the re-try automatically for you. An example in the application to application (A2A) integration world would be so called “asynchronous” communications, which are expressed in the following official BBA guideline (see http://wiki.sdn.sap.com/wiki/display/BBA/Loose+Coupling+Through+Web+Services ) :

SOA-WS-1

SAP recommends implementing remote consumption of business functionality using loosely coupled, asynchronous, stateless communication using web services. ….

You’d need asynchronous communications to de-couple 2 subsequent steps from each other so that individual re-tries can be done without needing the previous step being redone as well. Once you implemented loose coupling there are implicit prerequisites. Asynchronous communications need buffers or queues on both sides, the sending and receiving end, which is more effort to program and drives up operating cost through higher server main memory demands as well. But compared to doubling everything in the HA setup case, there seems to be a smaller price tag on resiliency, right?

High Availability and Resiliency are two different methods to get to the same goal of let’s call it high “Reliability” of the business process execution.

Which one is better depends on your total cost of development (TCD) vs. total costs of ownership. As said, loose coupling needs more effort on the application development side. Tight coupling, like synchronous Abap RFC calls are just so much easier to use for programmers. If you can afford higher development costs of loose coupling for integrations then the resiliency approach is much better. If hardware and operation costs are cheap it might be better to focus on end to end high availability setup and save on development costs.

In the traditional enterprise datacenter OP systems High Availability is often preferred. The number of things to double is usually small. TCD, the “Total Costs of Development” vs. TCO is decided by different corporations: TCD matters to the software vendor vs. TCO matters to their customers. This makes it a bit more difficult to find the right TCD/TCO trade-off point.

For OD or SaaS providers TCD and TCO are both on their side and folded together into their customers’ subscription fees. Therefore a SaaS vendor has a better opportunity to decide on the best TCD vs. TCO trade-off and can better decide about Resiliency vs. High Availability approaches. The mixed case of integrating On Premise with On Demand applications for overarching business scenarios is again a bit more complicated. Hybrid cloud spanning business processes tend to be more complex and therefore error prune in the first place and they involve distributed ownership of “things” between OP customer and OD vendor. What would be particular difficult is to implement support services like holistic end-to-end monitoring across OP and OD systems. Therefore I’d think any steps in between OP and OD systems should be loosely coupled and designed for Resiliency. Imagine you’d want High Availability between OP and OD applications: strictly speaking you’d need two SaaS vendors for the same application to guarantee fail over capabilities,not very feasible.

Is resiliency good enough to satisfy the business users?

I’d think this is the case in most instances. Let’s look at this last example: expense reporting. In the overall process of expense reporting an employee might enter his or her expense data in an OD cloud application. Then the OD application sends the expense data asynchronously to an OP finance system. If the communication between the OD and OP site fails, the end user would not even notice it. He/She still would be able to enter expense data. Only if the OP-OD link is broken permanently they’d wonder when they get expenses reimbursed into their bank account. But if the link is broken only temporarily the process could be completed eventually. Even a temporary outage of a whole day wouldn’t matter to most end-users in this use case.

In summary, I’d recommend to consider the different approaches of High Availability and Resiliency in regard to total costs of development and total cost of ownership. In complex business scenarios you’d likely have a large number of potential single points of failures and you might decide for a mix of HA and Resiliency approaches. For OP to OD integrations the resiliency approach should be chosen, due to very high cost or lack of OD provider options for High Availability.

Resiliency Vs. High Availability

Now live: 2014 SAP HANA and SAP HANA Cloud Applications Challenge voting

My Personal Ux, Fiori, Portal and Cloud Cheat Sheet

Web Dynpro ABAP Demonstration Videos