One of the nice things about writing about technology is that you often unearth gems that developers have built into software that might be overlooked by many other folks. One such instance is the ability that Exchange 2010 has to patch individual pages within a database.
In previous versions of Exchange, especially those prior to Exchange 2007, administrators hated seeing events 1018 or 1022 logged in the application event log because these events indicate that the Store had detected a page-level corruption in a database. Corruptions usually occurred as the result of some hardware problem, often in storage controllers and while a database would continue to run with the corruption, the problem was that the only way to fix a database that contained a corruption was to restore from the last good backup. Given that some databases were well over 100GB, this wasn’t a popular option with administrators.
A page corruption in a single copy database is a severe problem. Replicating the corruption to many database copies creates a problem with a whole new dimension. For this reason, Exchange 2010 is able to detect and fix page level corruptions that occur in active or passive database copies.
If the Store detects a problem page in the active database (usually after it checksums a page after reading it into memory), it places a marker in the log stream (in the current transaction log) that acts as a request for a valid copy of the corrupted page. The request is sent to all database copies where it is inspected and processed along with other log content. When the Information Store replays data for the passive copy, it notices the marker and responds to the request by invoking a “replication service callback” to ship a copy of the page to the server that hosts the active database. When this server receives the replicated page, the Store patches it back into the active database to remove the corruption. Other servers that host passive copies may also respond with pages but these are ignored once the active database has been restored to good health.
The process to fix a corrupted page in a passive database copy is slightly different. In this case, the server that hosts the passive copy immediately pauses log replay. Log copying continues to ensure that all of the transaction logs that will eventually be required to bring the database completely up to date are available on the server. The server then requests a copy of the corrupted page from the server that hosts the active database using the internal ESE seeding mechanism. The active server responds with the page data. The passive server then waits until all the log files necessary to bring it up to date past the point where the active server provided the page (as indicated by the maximum required generation) have been copied and inspected. When it is sure that all the required data is available, the passive server then restores the corrupt page and resumes log replay to clear the backlog of transaction logs that have accumulated since the corruption was first detected.
One important point that is often overlooked is that single page patching does not work with lagged database copies. The logic is simple – lagged database copies work by being deliberately kept at a certain point in time distant from the current time and only apply updates from transaction logs as those logs pass the desired lag interval. The broadcast for a patch mechanism used by regular database copies won’t work if a corruption is detected in a lagged database copy and the only way to fix such a corruption is to restore the lagged database from a backup.
– Tony
Learn more about the workings of the Exchange 2010 Information Store in Microsoft Exchange Server 2010 Inside Out!
Pingback: The myth surrounding the use of ESEUTIL to rebuild databases | Thoughtsofanidlemind's Blog