Understanding how Exchange Online runs


Understanding how Office 365 operates is always an interesting challenge because Microsoft usually doesn’t say too much about how things work within the service. But the Exchange Online folks are pretty good at telling what they are up to, which brings us to the “Behind the Curtain: How we run Exchange Online” session at Microsoft Ignite in Chicago, featuring the talents of Vivek Sharma (Director of Office 365 product management) and Perry Clarke (VP of Exchange development). This provided an update to a similar session given at MEC in 2014 and is available on Microsoft’s Channel 9 service.

Both speakers are interesting people in their own right. In the past, Vivek had a lot to do with the implementation of PowerShell in Exchange 2007 and since then has focused on bringing Exchange Online through from the initial beginnings to BPOS to where it is today. Perry is one of Microsoft’s deep thinkers. A conversation with him is likely to explore what we’ll all be doing in five years’ time and it’s obvious that he has some pretty solid ideas on that point. He wrote the foreword for our just-published “Office 365 for Exchange Professionals” eBook and said some nice things about us, much to the amusement of some of the members of the Exchange development group.

The session began with some comments from Perry about how cloud services are changing the way people think about technology. Typically, companies look at three factors to assess a technology. Cost, risk, and user experience (or functionality). Perry maintains that the cost of an Exchange Online mailbox is at a point that no on-premises implementation can match, if costs are accurately calculated and everything is included. Part of this is because Microsoft has massive buying power for datacenters, storage, servers, and network to achieve price points that even the largest on-premises customer can only dream about.

A solid SLA track record (the most recent result was 99.99% for the first quarter of 2015) means that the perceived risk of companies putting their most important work on Office 365 is much less than it was four years ago when Microsoft launched the service. Finally, the functionality that can be delivered by a cloud service is so much ahead of what is possible for on-premises deployments because of the direct involvement of the engineering group (and some functionality, like Delve and Clutter is only available in the cloud). In a nutshell, Perry advanced a case that cloud services is the only way to achieve the desired combination of cost, risk, and functionality for a technology like email.

Returning back to the size and scale of Exchange Online, some data was offered to illustrate what Microsoft manages. It’s obvious from Microsoft’s financial results that they are enjoying growing revenue from commercial cloud services (the last quarter reported an annual growth of 106%). This growth is reflected in a 1350% increase in Exchange Online servers from Aug 1 2012 to 23 April 2015 compared to the 600% increase reported in 2014. The massive increase in servers is required to maintain capacity and to allow Microsoft to absorb new tenants who move to Office 365.

Interestingly, Exchange Online uses the same kind of rings to distribute new software. The rings are the developers, the Office 365 team, Microsoft in general, First Release Office 365 tenants, and finally, General availability. A similar approach is followed with the current Windows 10 insider program and is due to be used in the Windows Update for Business program announced at Monday’s Ignite keynote.

The increase in users means that Office 365 now deals with 55 billion client authentications annually. That kind of transactional volume cannot be handled when the infrastructure doesn’t scale efficiently.

Exchange Online uses 150 petabytes of storage, most of which is taken up by the 4 copies of the 1.2 million mailbox databases. The standard Office 365 mailbox quota is 50 GB, but naturally it takes time for users (maybe 90 million – Microsoft isn’t saying) to use this quota. The thought went through my mind of how many of the 8 TB 7200 rpm standard JBOD drives used by Office 365 fail daily and how they track and fix all the failures. The answer for how many is “a lot” and the management is done through a mixture of a very sophisticated service fabric and human intervention (to remove and replace the failed drives).

The service fabric controls and manages operations flowing across the service and deals with more than 500 million events that are collected hourly. In addition, 250 million synthetic test transactions are generated daily to validate that the Exchange Online service is working properly. The signals gathered by the transactions are analyzed by computers to detect and fix problems, just like the Managed Availability system in Exchange 2013. There’s no surprise here because Managed Availability is an obvious example of technology transfer from the cloud to on-premises (even more technology is being transferred in Exchange 2016). Machine learning is applied to correlate signals and compare them against known sets (that represent a satisfactory condition) to allow engineers to triangulate and identify the particular root problem.

An automation and orchestration workflow engine is used to maintain servers. The most common problems are hardware (disks and controllers), network, and software bugs. Problems can be automatically fixed or left to engineers, who can set off workflow items to address issues. Processes such as server deployments and upgrades are also dealt with through workflow in a way that allows Exchange Online to bring new capacity online within days of deciding that it’s needed. In this respect, new capacity means something like an additional 40 Database Availability Groups rather than a single server.

A DevOps model is used to run Exchange Online. In other words, development engineers don’t simply throw code over the wall to operations and then switch off. Instead, members of the development group up to and including VP level are on call to handle problems that arise in the service. This ensures that engineers take responsibility for the code that they write for the service. If they get it right, happiness and undisturbed nights. But if they get it wrong…

I find sessions that provide an insight into the trials and tribulations of operating a massive multi-tenant environment very interesting and worthwhile. Although you can follow them later online, there’s nothing quite like hearing someone speak in person. This session reinforced my view that something very special occurs to make Office 365 operate. Take the time to view it once the video is posted and make your own mind up.

Follow Tony @12Knocksinna

About Tony Redmond

Lead author for the Office 365 for IT Pros eBook and writer about all aspects of the Office 365 ecosystem.
This entry was posted in Office 365 and tagged , , , , , . Bookmark the permalink.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.