Microsoft recently posted the Q1 2016 performance against SLA for Office 365 and reported a 99.98% outcome, which is the same number that they posted for the two previous quarters. Overall, things have been pretty consistent in terms of Office 365 service recently.
That’s not to say that Office 365 has not been without its problems. Looking at the Service Health Dashboard (SHD) for any tenant is likely to turn up some issues for any given period. It’s the nature of a very complex infrastructure that is in a state of perpetual software and hardware updates that some glitches will occur.
However, the sheer size of Office 365 and the number of tenants and users it now supports means that any single support incident or outage is unlikely to dent performance against SLA. At their Q3 FY16 analyst briefing, Microsoft said that Office 365 has 70 million active users, so the available time during a 91-day quarter is 9,172,800,000,000 minutes. A two-hour outage that affects 1% of the user base consumes 84 million available minutes and impacts the quarterly SLA by 0.00092%. Even a 12-hour outage affecting 5% of the user base (3.5 million users) only reduces the SLA by 0.02747%. Size definitely matters when it comes to SLA calculation.
Despite the media hype that invariably occurs when an Office 365 outage is publicized, the reality is that most Office 365 issues are highly localized. A software update might fail to “flight” some functionality to some tenants or introduce a bug to a set of servers in a particular datacenter. The problem is bad for the affected tenants but the vast majority of the other tenants, including those who have workload running in the same datacenter, will be blissfully unaware that problems are being worked by the support team.
Other utility services exhibit the same characteristics. A power outage in a transformer affects the houses and businesses that share the same circuit but the overall network keeps on running. A burst water pipe reduces pressure to the places it serves but taps everywhere else keep flowing.
It takes some time to get your head around what it means to run operations on a massive shared IT infrastructure. It’s a very different environment to a traditional on-premises infrastructure where a break in an essential component can stop service to everyone. Losing a network circuit because someone dug up a cable is a classic example of such a problem. Generally speaking, cloud services have multiple layers of redundancy built in to avoid the risk that problems introduced by the failure of a single component can spread. The structured and highly automated nature of operations, which is mandatory to manage hundreds of thousands of servers, also helps by eliminating issues that can be introduced through sloppy administration.
Microsoft offers a financially backed guarantee that Office 365 will attain an SLA performance of 99.9%. Soon after the launch of Office 365 in June 2011, they ran into some issues that caused major outages and had to pay out. However, as time goes on, the financial guarantee for the SLA looks like a pretty safe bet for Microsoft. The more users that Office 365 has, the less impact a single incident can have. It’s a nice position for them to be in.
Stay connected: Follow Tony @12Knocksinna