Perhaps the Exchange developers were unaware of the law of unintended consequences when they decided to change Exchange’s load balancing requirement from layer 7 to layer 4 of the network stack. For the most part, the change is wonderful. For some, especially those who have to manage systems that cater for large numbers of incoming connections, it creates an interesting question about protocol handling that deserves some attention in planning for the deployment of Exchange 2013. Such was the question addressed by Greg Taylor when he talked about load balancing options at The Experts Conference event in Barcelona last month. It’s taken me a while to decipher the scrawled notes that I took when Greg spoke…
Anyone dealing with high-end deployments based on Exchange 2007 and Exchange 2010 will be all too aware of the need to manage incoming connections carefully. Typically, the solution involved a hardware-based load balancer (running on physical or virtual hardware) that terminated the incoming SSL connection and then sent it on to one of an array of Client Access Servers that then processed the connection and directed it to the correct mailbox server. Horror stories about attempting to use software-based solutions such as the late but not-at-all lamented ISA Server to handle connections and the utter failure that ensued because of various limitations (not the least being that ISA is a 32-bit application) drove deployment teams to implement hardware-based systems such as F5 Network’s BIG-IP, a solution for which I have a lot of respect. Since then we’ve seen the advent of virtualized load balancers suitable for low- to-medium deployments. Those made by Kemp Technologies seem to be quite popular among Exchange 2010 administrators.
Exchange 2013 greatly simplifies the area of load balancing. You will still need to deploy hardware-based load balancers in situations where high availability is required, but Exchange 2013 supports solutions such as Windows NLB and round-robin balancing to cater for lower-end deployments. All of this is because the Exchange 2013 CAS does not do the rendering and protocol handling that its predecessors did. Instead, the Exchange 2013 CAS simply proxies connections to the appropriate mailbox server, which does all the real work. The idea here is to break the version linkage that previously exists between CAS and mailbox insofar as you couldn’t upgrade one without the other; version independence is a big theme for future versions of Exchange and if all goes well, you’ll be able to upgrade different parts of the infrastructure in the knowledge that the new components won’t break anything running on the old bits.
Simplification is always good in computer technology as complexity invariably leads to additional cost, confusion, and potentially poorer results. However, any change has consequences and one of those that flows from the move to L4 is the loss of protocol awareness. When a load balancer terminates an incoming SSL connection at L7, it is able to sniff the packets and figure out what protocol the connection is directed to. Exchange has a rich set of protocols including Exchange Web Service (EWS), Outlook Web App (OWA), ActiveSync (EAS), the Offline Address Book (OAB), and Exchange Administration (ECP), each of whose endpoint is represented as an IIS virtual directory. But when an L4 load balancer handles a connection, it sees it going to TCP port 443 and the IP address for the external connectivity point (such as mail.contoso.com). Later on the CAS will sort out the connection and get it to the right place, but that’s too late to have any notion of protocol awareness.
The problem is that a target CAS might be sick. Worse again, it might be sick for only one protocol. Exchange 2013 managed availability attempts to automatically resolve issues like this by taking actions such as recycling an application pool or even rebooting a server. But an L4 load balancer sees a CAS in the whole rather than having the ability to deal with different protocols, some of which are healthy and some of which might not be so good. With L7, the load balancer would be aware that OWA is up but EAS is down on a specific target CAS and be able to take action to redirect traffic as individual protocols changed their status.
You might not be too worried about this at all as you don’t think that an essentially stateless CAS (for this is what the Exchange 2013 version is) won’t fail too often and anyway, if one protocol fails it’s likely to reflect a server-wide problem. There’s a certain logic in this position, but at the higher end you might be in the position where it becomes important to be able to exercise selective control over individual connections going to specific protocols.
One way of achieving selective control is to publish specific connectivity points for each protocol as part of your external namespace. In other words, instead of having the catch-all mail.contoso.com, you’d have a set of endpoints such as eas.contoso.com, ecp.contoso.com, owa.contoso.com, and so on. The advantage here is that the L4 load balancer now sees protocol-specific inbound connections that it can handle with separate virtual IPs (VIPs). The load balancer can also monitor the health of the different services that it attaches to the VIPs and make sure that each protocol is handled as effectively as possible. The disadvantage is that you have more complexity in the namespace, particularly in terms of communication to users, and you have to make sure that the different endpoints all feature as alternate names on the SSL certificates that are used to secure connections. None of this is difficult, but it’s different than before. What you gain from the work done to transition from L7 to L4, you lose (a little) on extra work and perhaps the cost of some extra certificates.
We haven’t yet seen much advice published by the vendors of load balancers to provide platform-specific guidance on this issue. I’m sure that it will provoke an interesting debate when the advice arrives!
Follow Tony @12Knocksinna
Another great post, although I was aware of this information. We used F5 load balancers with our Exchange 2007 deployment and after working with the network guys we got that configuration to work well. However, one of the things we never got to work is creating a good F5 health check of the CAS server. Currently, the F5 health check is a simple check to very that IIS web server is running on the cas server which isn’t a very good health check as you describe. And yes we have encountered a situation where one cas server had activesync IIS pool down and traffic keept being sent to that CAS because the F5 thought the CAS server was ok because IIS responded. Final solution was to do seperate monitoring of IIS pools within each CAS server, and that works fine, but a better and automatic soution would be for the F5 to know and stop sending traffic to that CAS. I’d love to know what others have done in terms of creating F5 health check scripts to do a more intelligent health check of the server from the F5 rather than a simple get check of IIS (telling you that IIS is working on this server but you don’t know if the individual service OWA, EAS, etc pools are working).
Pingback: Loadbalancer.org Blog » Blog Archive » Exchange 2013 – Microsoft finally have an email solution designed for high availability and load balancing
Pingback: Exchange 2013 – Microsoft finally have an email solution designed for high availability and load balancing