Exchange 2010 SP1: Protecting data with a BSOD

In conversation with an administrator of one of the world’s largest Exchange 2010 deployments, we talked about new Store behaviour in Exchange 2010 SP1 that can lead to Exchange forcing an abrupt termination of the server (a blue screen of death with a CRITICAL_OBJECT_TERMINATION).

According to MSDN, a CRITICAL_OBJECT_TERMINATION error bug check “has a value of 0x000000F4. This indicates that a process or thread crucial to system operation has unexpectedly exited or been terminated.” It then goes on to emphasize that “If you have received a blue screen error, or stop code, the computer has shut down abruptly to protect itself from data loss. A hardware device, its driver, or related software might have caused this error.”

Why might Exchange 2010 SP1 force such a dramatic termination of a server? The answer is that new code introduced in SP1 monitors “hanging I/Os”. These I/Os occur when disks are too busy to handle the I/O load generated on the server and can result from storage that is simply not capable of handling the load or in situations where a configuration is not optimum and slows I/O in some conditions. Exchange looks for hanging I/Os because it wants to protect databases against potential loss. Obviously, it’s not good when an I/O has not completed as the I/O may be relevant to essential data. A hanging I/O may never complete as it might be in that charming condition called “hung” – in cyber never-land.

It’s worth noting that Exchange is not the only component that can force a Windows 2008 R2 server into a BSOD as Failover Clustering will also force a server bugcheck under certain conditions that Windows considers to be unrecoverable from without a reboot.

Exchange 2010 SP1 applies a threshold of sixty seconds when it looks for hanging I/Os. In other words, if an I/O has not completed in sixty seconds, it becomes a concern. A failure item is generated and is reported through the crimson (high priority critical incidents) channel to the Active Manager, which then takes whatever action seems appropriate. Hopefully the crisis will pass and the problem I/O will complete to allow Active Manager to cease worrying. However, if the problem persists and the I/O is still hung after four minutes (four times the threshold), Active Manager will force a restart of the server.

Again, the intention is to protect against data loss and that’s a good thing. The problem is the guillotine nature of the action. A BSOD is not a good thing in any administrator’s mind and Exchange provides no evidence to assure the administrator that the action taken was required for good reason. ESE is supposed to write events in the application event log when it encounters hanging I/Os – and any indication of ESE flagging a delayed I/O is worthy of investigation as it may be evidence of a hardware problem that’s about to become worse. However, the administrators that I have spoken to say that they’ve not seen ESE events in the application event log preceding the bugchecks. Microsoft says that the application event log should be updated following detection of the problem and before the failure item is issued to cause the bugcheck. It’s entirely possible that this is the way that events flow in the normal course of events and that perhaps in this situation something happened to cause a failure to log the events. A contact in the Exchange engineering group backs up this feeling by pointing out that the vast majority of problems that lead to a BSOD result from a complete outage in the storage stack so it’s logical (but unsettling) that events cannot be written into the log.

Even more interesting is how Active Manager interprets the criteria for when it forces a restart if multiple hanging I/Os are detected. We’ve already seen the simple condition when a single I/O is hung for four minutes – but if two I/Os are deemed problematic, then the restart will happen after two minutes. The bugcheck is provoked by the MSExchangeRepl process, which terminates Wininit.exe.

My original information, based on observations from some very experienced administrators was that the threshold for Exchange to invoke a BSOD changed in line with the number of hung I/Os. In other words, if four I/Os are in difficulty, the restart happens after a minute. And if your storage system is completely hosed and you have ten hung I/Os caused by something like a buggy storage driver, you would have the opportunity to watch a server reboot after 24 seconds. According to the experience in the field, the real threshold for a server BSOD seemed to be 240 seconds (four minutes) divided by the number of hung I/Os. However, shortly after the original post appeared, the Exchange team got in touch with me to say that the threshold is altered by the number of hung I/Os, but only up to two. As stated above, you can see a BSOD after two minutes if you have multiple hanging I/Os.

It’s absolutely correct that software should include automatic protection to ensure that data is not affected by hardware problems. It’s also correct that any storage which exhibits so many problematic I/Os that it provokes Active Manager to force a bug check is not a good candidate to support an enterprise application such as Exchange. However, the problem is that a BSOD is a drastic way of cleaning up the hung I/Os that doesn’t leave much evidence behind for the administrator to understand what caused the bug check. It’s kind of like cutting an arm off because it has developed gangrene and immediately burning the offending limb so that doctors can’t see the rotting flesh (OK, I know this is not a nice analogy).

The word is that Microsoft is looking at the situation with a view to making the behaviour more administrator-friendly. This is a good thing and I look forward to seeing the result of their work. I wonder if the upshot will be to replace the BSOD with a big red warning label saying something like:

“Duh! Active Manager thoughtfully crashed your server because your storage failed. Don’t panic! Exchange knows what it is doing and has protected all your data.”

Suitably equipped with skull-and-crossbones, this would be a more interesting error message than a bug check.

– Tony

Want to read more about the changes Microsoft has made to the Information Store in Exchange 2010 SP1? See chapter 7 of Microsoft Exchange Server 2010 Inside Out, also available at Amazon.co.uk or in a Kindle edition. Other e-book formats for the book are available from the O’Reilly web site.

5 responses to “Exchange 2010 SP1: Protecting data with a BSOD”

Tweets that mention Exchange 2010 SP1: Protecting data with a BSOD | Thoughtsofanidlemind’s Blog — Topsy.com

February 10, 2011

[…] This post was mentioned on Twitter by NGN NetGeNoten and Michel de Rooij, Tony Redmond. Tony Redmond said: Exchange 2010 SP1: Protecting data with a BSOD: http://t.co/rtl2NEd […]

Brian Desmond

February 10, 2011

There’s a bunch of things that can cause that same crash, but, in this case, Exchange (the replication service) is killing wininit which leads to the crash.

Tony Redmond (“Thoughts of an Idle Mind”)

February 10, 2011

Absolutely correct. It is the MsExchangeRepl service (Microsoft Exchange Replication service, not to be confused with the Microsoft Exchange mailbox replication service, or MRS) that forces the crash.

TR

1. Isaac
  
  November 28, 2011
  
  Rollup 6 was supposed to have ‘fixed’ this issue but we are still having it on one site. Server is sitting on Xenserver 6 and I have not been able to locate any issue with the storage… what a PITA.
  
Exchange 2010 SP1 Hung I/O watchdog thread causes intentional BSOD | Christopher Meehan's Blog of Tech Gibberish

October 3, 2011

[…] Blog Post on Exchange 2010 Hanging I/O – https://thoughtsofanidlemind.wordpress.com/2011/02/10/sp1-and-bsod/ […]