One of the best things about delivering training to smart people is the questions that they pose after you introduce a topic. During the recent Exchange 2010 Maestro seminars that Paul Robichaux and I delivered in Boston and Anaheim, I took the lead in talking about the Database Availability Group (DAG) and the deployment options that are now available to Exchange 2010 administrators. Some of the questions that were raised then caused me to consider the value of lagged database copies to a DAG, which then provoked this blog post.
Consultants and other commentators often consider the use of a lagged database copy within a DAG for Exchange 2010 deployments. Typically, once there is more than two passive database copies, thoughts turn to the creation of a lagged copy to provide the ability for a “point in time” recovery should the need arise. Possibly people want to use new features, possibly they are influenced by the comments of others. Let’s explore the topic a little more.
The best thing about a DAG is that you can achieve resilience against failure by creating multiple copies of databases that Exchange will keep up to date through log shipping. However, some published advice exists that the second passive copy should be lagged. For example, Symantec’s page titled “Best practices for Backup Exec 2010 Agent for Microsoft Exchange Server” advises “If you can make more than one passive copy, the second passive copy should use a log replay delay of 24 hours”.
Of course, we are still learning about the best and most effective practices for DAG designs and it’s natural that people would want to use one of the new DAG features in their deployments. I think that there are a number of points that you need to consider before you deploy a lagged database copy into production.
First, what is a lagged database copy? A lagged database copy is one that is not updated by replaying transactions as they become available. Instead, the transaction logs are kept for a certain period and are then replayed. The lagged database copy is therefore maintained at a certain remove to the active database and the other non-lagged database copies.
The primary reason to use lagged database copies (7 or 14 days are common intervals) is to provide you with the ability to go back to a point in time when you are sure that a database is in a good state. By delaying the replay of replicated transaction logs into a database copy, you always have the ability to go to that copy and know that it represents a point in time in the past when the database was in a certain condition. Two mailbox database properties govern how lagged copies function. You can set these properties with the Set-MailboxDatabaseCopy cmdlet or indeed set them when you create a new copy with the Add-MailboxDatabaseCopy cmdlet:
- ReplayLagTime: the time (in minutes) governing the delay that Exchange applies to log replays for database copies (replay lag time). Setting this value to zero means that Exchange should replay new transaction logs immediately they are copied to servers that host database copies. The intention is that you have the chance to keep a server running in a state slightly behind the active copy so that if a problem occurs on the active server that results in database corruption, you will be able to stop replication and prevent the corruption occurring in database copies. Typically, DAGs that use a lagged copy are configured so that there are two or three database copies kept up to date and one (usually in a disaster recovery site) that is configured with a time lag. The maximum lag time is 14 days.
- TruncationLagTime: the time (in minutes) governing log truncation delay. Again, you can set this value to zero to instruct Exchange to remove transaction logs immediately after their content has been replayed into a database copy, but most sites keep transaction logs around for at least 15 minutes to ensure that they are available if required to bring a database copy up to date should an outage occur. The maximum truncation lag time is seven days.
We have to realize that a lagged database copy can occupy a large amount of storage. Apart from the normal requirement to provide storage for the database itself, you must assign space for all the transaction logs for the lag period and this could be significant for a busy database that supports hundreds or thousands of mailboxes and generates many gigabytes of transaction logs daily. The transaction logs for a lagged database copy contain transactions that are not yet committed to the database. Exchange commits the transactions when the lagged period expires, so if you have a lagged period of 7 days, Exchange has to keep 7 days volume of transaction logs.
Executing a smooth and stress-free recovery is the big issue that I see with lagged copies. Microsoft provides no user interface to recover data from a lagged database. The steps required to bring a lagged database copy online as the active copy are reasonably straightforward but they are manual and depend on a reasonable degree of knowledge on the part of the administrator. You can mount a lagged database as a recovery database if all you need is to recover one or more specific mailboxes to a point in time, but this operation is not well documented so expect to have to practice it before attempting it in production. If you decide that a point in time restore is required for a complete database (a pretty catastrophic situation) and make a lagged database the active copy, you force a reseed for all other database copies. This is a further impact on service delivery.
The need to assign and manage sufficient storage is reasonably simple. The lack of a Wizard or other GUI to guide an administrator through the use of a lagged database copy in recovery operations is more serious. Few companies have staff who are experienced in this kind of interaction with a DAG (it will come with time), so if a time ever occurred when the lagged database copy is required, there’s a fair chance that all hell will break loose and panic ensues before people figure out what to do. It should be an interesting conversation with Microsoft support:
Administrator: “Hi, I need to bring a lagged database copy back online because (insert reason here)”
Microsoft support: “Interesting… hang on a moment… (pregnant pause)”
Administrator: “Hallo, is anyone there?”
Microsoft support: “I’m just checking our support tools to see how best to proceed…” (the story evolves from this point and everyone is happy in the end)
If this discussion causes you some concern, what can you do? I think there are two routes worthy of investigation. Expanded use of the enhanced “dumpster” in Exchange 2010 is an obvious solution for recovery of individual mailboxes. In other words, keep more data in the dumpster just in case someone needs to recover an item and hope that you are never asked to recover a complete mailbox to the state that it was at a point in time. If you are asked, you need to restore the database from a backup (you’re still taking backups – right?), run ESEUTIL to fix the database and allow a clean mount, mount the restored database as a recovery database, and then use the Restore-Mailbox or New-MailboxRestoreRequest (available from SP1 onwards) cmdlets to recover data into a PST that you can then import into the user mailbox or provide to the user.
Recovery of complete databases is a different matter. My recommendation is that you should invest in storage or backup technology that incorporates strong recovery capabilities. Some storage offers very good snapshot recovery capabilities so that recovery is a matter of selecting the appropriate snapshot and recovering from it; some backup products provide similar capabilities. Your choice will be dictated by personal preference, previous deployment history, and your knowledge of how strong support personnel are within your company. In other words, you’ll select the best tool for the job to fit the unique circumstances of your Exchange 2010 deployment.
I’m sure that others will have their own views on the topic. For now, I just can’t see how I could recommend the deployment of lagged database copies. Comments are more than welcome…