In episode 5 of “The UC Architects”, Steve Goodman came up with a topic about the (non)sense of Active/Active multi-site DAGs. That discussion – which in essence is about High Availability – was carried forward to episode 6 where I had an interesting talk with fellow UC Architects John Cook and Mahmoud Magdy.

I was actually somewhat surprised to hear that most of them were facing the same issues when designing a highly available environment for Exchange, or any other product for that matter. And (for once) I’m not talking about technical issues or difficulties. Instead, I’m pointing to the numerous discussions with the business where you try explaining what high availability is or even better – how to approach their demands.

Therefore, I wanted to create this article in which I express my personal take on the concept of “high availability”, more specifically in the area of Exchange deployments. Simply because this is my main occupancy. Nonetheless, I’m sure that some parts will also apply to the broader picture.

This article will represent some of my darkest inner thoughts (frustrations?) as well as some real-life experiences interspersed with – what you’ll hopefully find – useful tips.

Note   I understand that, in the real world, you’ll be faced with things that not always relate to what I’m writing. Hell, I even encounter that regularly! My ramblings below are rather meant as a theoretical insight which you could use to start your work in the battlefield! Winking smile

What is High Availability?

According to Wikipedia, High availability is…

…a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

Before reading further, re-read the Wikipedia-definition!

I do not always agree with Wikipedia, but there’s an important truth in this definition: measurability.Although high availability system is created by (individual) (technical) elements; the effectiveness of a design/implementation can only be proven if it’s measurable. For it to be measured, you at least need a reference (Service Level Agreement) to compare the measured value against (and determine if you were effective or not) and a common agreement on what and how you are going to measure things.

Usually, the effectiveness is expressed in uptime. This uptime is represented by a percentage of a given (pre-defined) timeframe. For instance: 99% within one year. Of course, that timeframe could as well be a day, a week, a month, …

Nowadays, even the smallest deployments tend to have some form of high availability. Why? Simply because it can. Ever since Exchange Server 2007 (and even more with 2010), highly available messaging infrastructures have become a viable and affordable choice for most enterprises. I believe that is a good thing. At least to some degree.

Email – let’s face it – has become (or is in the process of becoming) a commodity. Having to live/work without it becomes increasingly more difficult: some companies rely (entirely) on email for their business to continue working. Now, if that isn’t a compelling reason to make sure that your design is highly available, I don’t know what you’ve been smoking…

On the other hand, the question you should ask yourself: is email really that business critical? By default most enterprises will answer that question with “yes”. As a consultant, you could happily accept that answer and move on… But that won’t get you far, would it?

Instead, try asking how critical email is to their business. You’d be surprised of the variety of answers you might be getting. It even gets worse when asking what the uptime of the system(s) should be. I guarantee you that when you’re asking how much a system must be up and running , you’ll hear “as much as possible” or even  “all the time.” lots and lots of times. While these answers might seem pretty logical to the business, they prove rather useless when designing a (messaging) solution. They’re useless because none of these statements are really measurable. Question: have you ever tried measuring “as much as possible”?

I did… And surprisingly enough, according to myself, I always meet the expectations. Simply because I always make sure that a system is available as much as possible. If not, I wouldn’t be trying hard enough. Even if a system would fail every single day, over and over (e.g. due to crappy hardware), I would still consider that it was available “as much as possible” because I did what was expected: I did everything I could to avoid downtime. In other words, I made sure that (up to a certain degree) I did everything to keep the system up and running as much as possible. Except, I’m pretty sure that the business owner(s) see it otherwise…

I think that, by know, you get where I’m going. The example above just pointed out that perception is reality and that perceptions (IT vs business) might be different (hello Matrix!). So, in a way, it’s always a good thing trying to get a “number” from the business even if that might not always be possible…

Now before heading off with the idea of “selling” 100% uptime, consider the following:

Uptime Outage duration per timeframe (e.g. 1 year)
99 % 3,65 days
99,5 % 1,83 days
99,9 % 8,76 hours
99,95 % 4,38 hours
99,99 % 52,56 minutes
99,999 % 5,26 minutes
99,9999 % 31,5 seconds
99,9999 % 3,15 seconds
100 % 0 seconds

When sitting down with the business to define the uptime, always keep these numbers in mind! It’s your duty to explain what the consequences of a given choice are: costs tend to grow exponentially the more you are trying to get to 100% availability.

image

The questions that now rises is: “what should be measured?”. For instance, measuring the uptime of a server doesn’t seem like a good idea. While that information might prove useful for statistics, a single server’s uptime doesn’t necessarily contribute to the availability of a given functionality. And that’s exactly what we want: measure the availability of a functionality offered to the business.

When defining the SLA’s, more specifically the uptime of a system, try disconnecting what should be measured from the (technical) layer below: stop thinking components (e.g. “Mailbox”, “CAS”, …); start thinking “Sending/Receiving Emails” or “Internal access to emails”. At least for now. In a later stage of the process it’s your task to translate these requirements into the correct technical architecture.

Conclusion

That’s it for part 1 of this article. In the second part, I’ll continue talking about the process of defining the (business) requirements and translating them afterwards into technical requirements or – architecture.