Mailbox quarantining in Exchange 2010 and Exchange 2013


This is another article that was never published by WindowsITPro.com (possibly because I contribute so many posts to my “Exchange Unwashed” blog on the site), so I make it available here with the caveat that this text has not been tech-edited or checked over in the same way that an article would be. Enjoy!

Follow Tony @12Knocksinna

Automatic mailbox quarantine

Software developers always want to do the best thing for their product. The Exchange developers follow this truism and usually do a fine job of creating new functionality that’s easy to understand and useful in practice. The bad old days when features were implemented only when an administrator applied an undocumented and known-to-just-a-select-few registry hack are thankfully long gone. And then you meet automatic quarantining of mailboxes as implemented in Exchange 2010 SP1 onward, including Exchange 2013.[

The idea behind automatic quarantining of mailboxes is excellent. Essentially, it’s designed to detect clients that are taking up too much of the Store’s attention because something is going wrong. MAPI clients like Outlook use multiple threads within the Store process when they connect to mailboxes. If one or more of these threads “freeze” for some reason, they can cause the Store to consume more CPU than it should in an attempt to service the thread. The problem might be caused by corrupt mailbox data or a software bug in either the client or Store process or some other reason such as network failure. In any case, it’s a bad thing if threads freeze or terminate abnormally.

Automatic processing of problematic mailboxes

Software being software, a thread can go bad from time to time and the Store isn’t too concerned if a solitary thread wanders away into nothingness occasionally. However, if more than five threads connected to a mailbox freeze for more than 60 seconds or three threads crash within a two hour period, then the Store considers the mailbox to be in an abnormal state.

Quarantining is performed by a background thread that runs every two hours within the Store to check the number of crashes experienced by mailboxes. If a mailbox exceeds the crash threshold it is deemed to be a threat to the overall stability of the Store and is therefore put into quarantine. A 10018 event is logged to record this action. Of course, administrators are constantly monitoring their servers in case such an event should arise and will immediately swing into action to resolve the underlying problem that caused the Store to take such drastic action.

You can modify the quarantine thresholds on a per-server basis by creating the following values in the system registry (if the values aren’t present, Exchange uses the default thresholds):

Key:

HKLM\SYSTEM\CCS\Services\MSexchangeIS\ParameterSystem\Servername\Private-dbguid\Quarantined Mailboxes

Values:

  • MailboxQuarantineCrashThreshold: The number of thread crashes that can be experienced by a mailbox before the store considers it a candidate to be quarantined.
  • MailboxQuarantineDurationInseconds: The number of seconds a mailbox remains in quarantine before it is released. The default is 21,600 seconds, or six hours.

Note that the Store may create a separate registry key for every mailbox database on a server. The entries corresponding to a specific database are identified with the GUID for the database inserted in the “Private-dbguid” element of the registry key.

When a mailbox is quarantined, the Store writes an entry into the system registry at:

HKLM\SYSTEM\CCS\Services\MSexchangeIS\Servername\Private-dbguid\Quarantined Mailboxes\ {Mailbox GUID}

Again, a separate registry key is maintained for each database on the server, so you’ll find a list of quarantined mailboxes for each database, assuming that at least one mailbox has experienced problems in a database to force the Store to create the registry key. If a mailbox has never crashed on a server, you won’t find any entries in the registry. On the other hand, if a database is having a lot of problems, you might find many mailboxes in its “crash list”.

The last part of a mailbox’s registry key is the GUID that uniquely identifies the mailbox. Thus, if several mailboxes are quarantined, you’ll find a set of GUIDs listed. The Store checks the registry when it mounts a database to discover whether any mailboxes are quarantined. Two values that are used by the background thread to decide whether to quarantine a mailbox are stored under the entry for each mailbox:

  • CrashCount: the number of times that this mailbox has crashed a thread within the Store
  • LastCrashTime: the last time that a mailbox thread crashed within the Store

The processing performed by the background thread is as follows:

  • If a mailbox in a database’s “crash list” has a CrashCount value less than the MailboxQuarantineThreshold (default 3) in the last two hours, the background process considers the mailbox to no longer be a threat and removes its registry entry from the list for the database.
  • If a mailbox in a database’s “crash list” has a CrashCount greater than MailboxQuarantineThreshold and the mailbox is not already quarantined, the background thread immediately puts the mailbox into quarantine and logs the 10018 event.
  • If a mailbox’s quarantine period has expired, the background thread releases it from quarantine and logs event 10019 to report the mailbox’s release. Otherwise the mailbox remains quarantined until the next check.

Practical problems with the implementation

All of this sounds wonderful in theory and it does indeed work. However, from a practical perspective, the problem is that Exchange puts mailboxes into quarantine in almost silent mode. By this I mean that the first indication that something’s up is when the user notices that they can’t connect to their mailbox (Figure 1) and so complain bitterly to the help desk, which then creates a ticket to let the administrator know that something’s up. At this point the 10018 event might be discovered because it contains something like this:

The mailbox for user /o=TonyR/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=TRedmond has been quarantined. Access to this mailbox will be restricted to administrative logons for the next 6 hours

OutlookProblem

Figure 1: Outlook hits a quarantined mailbox

In a nutshell, the user is not getting into this mailbox until at least six hours has elapsed since the mailbox was quarantined. No incoming mail will be delivered to the mailbox while it is in quarantine.

An administrator can access a quarantined mailbox by logging into it with MFCMAPI. Once connected, they might be able to find some problem in the mailbox, provided that they knew where to look or what to look for, which isn’t usually the case. No indication is given in the event log as to the location in the mailbox (folder and/or particular item) that’s caused a problem.

No user interface to help

Exchange 2010 provides no user interface for administrators to find out whether any mailboxes are quarantined (I don’t regard the registry editor to be a user interface in this instance). Assuming that you probably can’t resolve the GUIDs contained in the system registry to determine the actual mailbox names (human brains seldom have the code necessary to do this translation), you could run the following command in the Exchange Management Shell (EMS):

Get-MailboxStatistics | Where {$_.IsQuarantine –eq $True}

The first part of this command uses the Get-MailboxStatistics cmdlet to interrogate all of the databases on a server to fetch interesting information about the mailboxes contained in the databases. The second part filters the output to return a list of quarantined mailboxes.

Knowing that some mailboxes are quarantined is one thing. Knowing why they ended up there or how to get the mailboxes out of quarantine is quite another. There are a myriad of reasons why store threads might crash. For instance, some reports indicate that transient disk errors on virtualized mailbox servers can generate crashes that cause mailboxes to be quarantined. The solution here was to move the guest server to another host node.

Other reasons exist that might cause quarantining to occur. For example, the Exchange team blog reported that a change made to Service Pack 2 of the Exchange Management Pack for Microsoft System Center Operations Manager (SCOM) caused mailboxes to be quarantined because a high rate of transaction log generation was experienced on the drive holding the database that hosts the mailboxes. A bug in an Information Store Performance Monitor counter caused invalid data to be provided to the “Information Store TroubleShooter” script (Troubleshoot-DatabaseSpace.ps1,) that’s installed on mailbox servers from Exchange 2010 SP1 onward. SCOM uses the script to monitor available free space on drives that host mailbox databases and if the data indicates that a drive will exhaust available disk space in the next twelve hours (the default threshold) and the available disk space is less than 25% of a drive’s capacity, the settings used by SCOM force the script will quarantine the mailboxes that generate the heaviest use to bring the log generation rate down and allow an administrator to step in and free space on the disk. This all sounds perfectly logical until Performance Monitor passed back invalid data to cause SCOM to request mailboxes to be quarantined when they did not need to be.

I have no reports of similar problems being encountered with Exchange 2013.

It might be possible for an administrator to associate an external event (network outage, disk crash, power failure) with problems for some mailboxes by collating data relating to those events with reports of mailbox quarantining. On the other hand, as was the case with the SCOM Exchange Management Pack, a bug might cause quarantining to happen out of the blue with literally no apparent reason for mailboxes to be isolated.

Release from quarantine

Releasing quarantined mailboxes is another challenge. You can, of course, wait for the quarantine interval to elapse (perhaps after reducing the default interval by updating the registry). Or you could practice brain surgery and delete the registry entries that identity the problematic mailboxes.  After removing the registry entries, you’ll need to dismount and remount the database to force the mailbox out of quarantine. Although dismounting and remounting a database usually takes less than a minute, this action will affect other users who are connected to the database so it’s not something to rush into.

Another point to consider is that there’s a reason why the Store quarantined a mailbox in the first place. If you leap into action and update the registry to release the mailbox, you run the risk that some lurking corruption still exists that might cause the mailbox to be re-quarantined very soon or, even worse, that the mailbox might affect the Store and perhaps cause a server to become unstable. On the other hand, anecdotal evidence from some deployments indicates that some thread crashes don’t recur and mailboxes function perfectly happily after they come out of quarantine.

If you think that a mailbox has some problems, moving it to another database is a good way to have the Mailbox Replication Service (MRS) validate the mailbox content and remove any corruptions. Essentially, when MRS moves a mailbox, it rebuilds the mailbox in the target database. When you create the mailbox move request, you should specify a reasonable bad item limit (in the 10-20 range) to allow MRS to drop any corrupted items that it comes across when it moves the mailbox. You can later review the mailbox move report to determine how many bad items MRS dropped and whether these items contained any important information.

Summary

It’s great that the Exchange developers built automatic mailbox quarantining into the Exchange Information Store. In concept, the feature protects the Store against the effect of data corruption and software bugs. It’s just a pity that the implementation smacks of the bad old days of registry hacks. Some user interface to control quarantined mailboxes and to understand the reason why mailboxes are put into this state would be appreciated.

About these ads

About Tony Redmond ("Thoughts of an Idle Mind")

Exchange MVP, author, and rugby referee
This entry was posted in Email, Exchange, Exchange 2010, Exchange 2013 and tagged , , , , . Bookmark the permalink.

One Response to Mailbox quarantining in Exchange 2010 and Exchange 2013

  1. Pingback: NeWay Technologies – Weekly Newsletter #80 – January 30, 2014 | NeWay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s