Exchange’s underappreciated single-page patching capability


I really like single page patching, the facility first introduced in Exchange 2010 to enable a database to detect that a database page is corrupt and to retrieve replacement data from another database copy. It’s one of those elegant pieces of functionality that have been introduced as Microsoft improves levels of automated management and self-monitoring throughout Exchange.

Single page patching is not something you would necessarily notice. Unless, of course, you are in the habit of scanning the application event log or have configured some software to do the job for you. Being invisible yet effective is a major advantage. What you don’t know won’t worry you – and why should you be worried if holes appear in database pages that can be automatically fixed by Exchange?

Before single page patching came along, administrators lived in fear of the dreaded -1018 error, which indicated that corruption had appeared in a database. Corruption arose from many sources, software bugs and storage controllers often being likely culprits. When a -1018 error struck, the only solution was to restore the database from backup and bring it up to date by replaying transaction logs, an activity that was long, tiresome, and prone to error. No administrator ever got out of bed singing the praises of having to do a database restore to fix an ailing Exchange server.

The advent of the Database Availability Group (DAG) created a different environment. Sure, Exchange 2007 had introduced multiple database copies (well, two…) with CCR and SCR cluster configurations, but CCR/SCR operations were manual and complex.

A DAG is usually a more complex configuration than any CCR/SCR cluster. However, Microsoft did a good job of hiding the complexity that exists in the Windows Failover Cluster underpinnings of the DAG as well as most DAG operations (restoring a lagged database copy remains something that could be more automated). And because a DAG supports up to sixteen copies of a database (more like a maximum of four in practice), it is logical to assume that even if corruption was to strike one copy, an uncorrupted version of the affected data should exist and be available elsewhere within the DAG. This is the central notion that lies behind single page patching.

When a corrupted page is detected in the active copy of a database, the Replication service is able to issue a form of “all-points” bulletin that goes to servers holding the other (passive) database copies to request the data necessary to patch the active copy. The signal (the “page patch request list”) is recognized when it arrives on the servers hosting the passive copies and the required data is transmitted back to the requesting server, which uses the inbound data to patch the active database. Any updates with the same data that arrive from other servers are ignored. Simple yet effective.

Different processing occurs when a corrupt page is detected in a passive database copy. A request for data is sent back to the server hosting the active database, a copy of the good page is sent back, and is applied to the passive copy. Again, simple and effective.

Most of the problems fixed by single page patching are in the -1018 category. Other issues (such as another infamous bugbear, the -1022 problem) can also be fixed using this method.

The best thing about this form of patching is that you will probably remain blissfully unaware that it has occurred. Unless, of course, you scan the application event log and find events like this:

Log Name:      Application
Source:        ExchangeStoreDB
Date:          11/12/2014 7:24:56 PM
Event ID:      129
Task Category: Database recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      xxx
Description:
At '11/5/2014 7:24:54 PM' the Exchange store database 'DB1' copy on this server encountered an error. For more detail about this failure, consult the Event log on the server for other "ExchangeStoreDb" or "msexchangerepl" events. Page patching was initiated to restore the page.
Log Name:      Application
Source:        ExchangeStoreDB
Date:          11/12/2014 7:25:00 PM
Event ID:      130
Task Category: Database recovery
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      xxx
Description:
At '11/12/2014 7:24:58 PM', the copy of database ‘DB1' on this server encountered an error that it was able to repair. For specific information that may help identify the failure, consult the Event log on the server for other "ExchangeStoreDb" or "MSExchangeRepl" events. The Microsoft Exchange Replication service will automatically attempt to retry the operation.

Log Name:      Application
Source:        ExchangeStoreDB
Date:          11/13/2014 9:03:25 PM
Event ID:      104
Task Category: Database recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      xxx
Description:
At '11/13/2014 9:03:24 PM' the Exchange store database DB2 copy onthis server experienced an I/O error that it may be able to repair. For more detail about this failure, consult the Event log on the server for other storage and "ExchangeStoreDb" events. Page patching was initiated to restore the page.

How often do these events occur? Hopefully, not often. A lot depends on the stability and capability of your storage infrastructure. High-end enterprise-class storage systems generally create fewer corruption events than low-end JBOD systems. Generally – not always.

I once asked some of the team that run the Exchange Online servers inside Office 365 how often they see events that result in single page patching. Their enigmatic answer was “often enough for the feature to be valuable”. Make of that what you will. All I’ll say is that the high availability story around Exchange would be very different if single page patching did not exist. And can you imagine how disruptive it would be within Office 365 if a database suddenly went bad and required human intervention to fix a corrupt page? That wouldn’t be good at all… and increasing automation is the reason why features like the Replay Lag Manager exist inside Exchange.

As Tim McMichael, Microsoft’s broken-cluster-and-high-availability fix-up guru (that’s not his real title, but I like it) observed at IT/DEV Connections in September 2014, just about the only downside of single page patching is that it can disguise some underlying hardware problem that hasn’t quite failed yet but will do so soon. Exchange is so good at patching that it can make a faltering storage controller look good… but not for long. After all, that would be a case of applying lipstick to a pig and the pig is still liable to burp. Or something like that.

Follow Tony @12Knocksinna

Advertisements

About Tony Redmond ("Thoughts of an Idle Mind")

Exchange MVP, author, and rugby referee
This entry was posted in Email, Exchange, Exchange 2010, Exchange 2013, Office 365 and tagged , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s