One of the best things about delivering training to smart people is the questions that they pose after you introduce a topic. During the recent Exchange 2010 Maestro seminars that Paul Robichaux and I delivered in Boston and Anaheim, I took the lead in talking about the Database Availability Group (DAG) and the deployment options that are now available to Exchange 2010 administrators. Some of the questions that were raised then caused me to consider the value of lagged database copies to a DAG, which then provoked this blog post.
Consultants and other commentators often consider the use of a lagged database copy within a DAG for Exchange 2010 deployments. Typically, once there is more than two passive database copies, thoughts turn to the creation of a lagged copy to provide the ability for a “point in time” recovery should the need arise. Possibly people want to use new features, possibly they are influenced by the comments of others. Let’s explore the topic a little more.
The best thing about a DAG is that you can achieve resilience against failure by creating multiple copies of databases that Exchange will keep up to date through log shipping. However, some published advice exists that the second passive copy should be lagged. For example, Symantec’s page titled “Best practices for Backup Exec 2010 Agent for Microsoft Exchange Server” advises “If you can make more than one passive copy, the second passive copy should use a log replay delay of 24 hours”.
Of course, we are still learning about the best and most effective practices for DAG designs and it’s natural that people would want to use one of the new DAG features in their deployments. I think that there are a number of points that you need to consider before you deploy a lagged database copy into production.
First, what is a lagged database copy? A lagged database copy is one that is not updated by replaying transactions as they become available. Instead, the transaction logs are kept for a certain period and are then replayed. The lagged database copy is therefore maintained at a certain remove to the active database and the other non-lagged database copies.
The primary reason to use lagged database copies (7 or 14 days are common intervals) is to provide you with the ability to go back to a point in time when you are sure that a database is in a good state. By delaying the replay of replicated transaction logs into a database copy, you always have the ability to go to that copy and know that it represents a point in time in the past when the database was in a certain condition. Two mailbox database properties govern how lagged copies function. You can set these properties with the Set-MailboxDatabaseCopy cmdlet or indeed set them when you create a new copy with the Add-MailboxDatabaseCopy cmdlet:
- ReplayLagTime: the time (in minutes) governing the delay that Exchange applies to log replays for database copies (replay lag time). Setting this value to zero means that Exchange should replay new transaction logs immediately they are copied to servers that host database copies. The intention is that you have the chance to keep a server running in a state slightly behind the active copy so that if a problem occurs on the active server that results in database corruption, you will be able to stop replication and prevent the corruption occurring in database copies. Typically, DAGs that use a lagged copy are configured so that there are two or three database copies kept up to date and one (usually in a disaster recovery site) that is configured with a time lag. The maximum lag time is 14 days.
- TruncationLagTime: the time (in minutes) governing log truncation delay. Again, you can set this value to zero to instruct Exchange to remove transaction logs immediately after their content has been replayed into a database copy, but most sites keep transaction logs around for at least 15 minutes to ensure that they are available if required to bring a database copy up to date should an outage occur. The maximum truncation lag time is seven days.
We have to realize that a lagged database copy can occupy a large amount of storage. Apart from the normal requirement to provide storage for the database itself, you must assign space for all the transaction logs for the lag period and this could be significant for a busy database that supports hundreds or thousands of mailboxes and generates many gigabytes of transaction logs daily. The transaction logs for a lagged database copy contain transactions that are not yet committed to the database. Exchange commits the transactions when the lagged period expires, so if you have a lagged period of 7 days, Exchange has to keep 7 days volume of transaction logs.
Executing a smooth and stress-free recovery is the big issue that I see with lagged copies. Microsoft provides no user interface to recover data from a lagged database. The steps required to bring a lagged database copy online as the active copy are reasonably straightforward but they are manual and depend on a reasonable degree of knowledge on the part of the administrator. You can mount a lagged database as a recovery database if all you need is to recover one or more specific mailboxes to a point in time, but this operation is not well documented so expect to have to practice it before attempting it in production. If you decide that a point in time restore is required for a complete database (a pretty catastrophic situation) and make a lagged database the active copy, you force a reseed for all other database copies. This is a further impact on service delivery.
The need to assign and manage sufficient storage is reasonably simple. The lack of a Wizard or other GUI to guide an administrator through the use of a lagged database copy in recovery operations is more serious. Few companies have staff who are experienced in this kind of interaction with a DAG (it will come with time), so if a time ever occurred when the lagged database copy is required, there’s a fair chance that all hell will break loose and panic ensues before people figure out what to do. It should be an interesting conversation with Microsoft support:
Administrator: “Hi, I need to bring a lagged database copy back online because (insert reason here)”
Microsoft support: “Interesting… hang on a moment… (pregnant pause)”
Administrator: “Hallo, is anyone there?”
Microsoft support: “I’m just checking our support tools to see how best to proceed…” (the story evolves from this point and everyone is happy in the end)
If this discussion causes you some concern, what can you do? I think there are two routes worthy of investigation. Expanded use of the enhanced “dumpster” in Exchange 2010 is an obvious solution for recovery of individual mailboxes. In other words, keep more data in the dumpster just in case someone needs to recover an item and hope that you are never asked to recover a complete mailbox to the state that it was at a point in time. If you are asked, you need to restore the database from a backup (you’re still taking backups – right?), run ESEUTIL to fix the database and allow a clean mount, mount the restored database as a recovery database, and then use the Restore-Mailbox or New-MailboxRestoreRequest (available from SP1 onwards) cmdlets to recover data into a PST that you can then import into the user mailbox or provide to the user.
Recovery of complete databases is a different matter. My recommendation is that you should invest in storage or backup technology that incorporates strong recovery capabilities. Some storage offers very good snapshot recovery capabilities so that recovery is a matter of selecting the appropriate snapshot and recovering from it; some backup products provide similar capabilities. Your choice will be dictated by personal preference, previous deployment history, and your knowledge of how strong support personnel are within your company. In other words, you’ll select the best tool for the job to fit the unique circumstances of your Exchange 2010 deployment.
I’m sure that others will have their own views on the topic. For now, I just can’t see how I could recommend the deployment of lagged database copies. Comments are more than welcome…
Couldnt agree more. The concept of Lagged Database Copies is a good one and even the name is enough to make people believe its the best idea since sliced bread. Personally I feel I have solid 2010 experience and skillset, but ask me to recover a lagged copy and I’ll be straight onto Google!
IMHO, lagged copies seem like a cool thing you can do with log shipping technology but don’t have much place in a production environment
I’ve always held to the idea that for email you shouldn’t hold any backup or recovery data of any kind without a full operational plan on how to use it. The last thing you need in a disaster recovery scenario is a choice of things to try and even worse the ensuing discussion around which you should try and who will be responsible for the decision. You want one thing that you know will work. With retention being a legal requirement these days anyway, the the full indexed, de-duped archive seems like a much better option for personal data integrity.
In the case of a corrupted DB, and noting that this is becoming increasingly rare, moving the entire DB back in time 7 days to restore it’s integrity is not an option that will allow you to keep your job. Unless you have very very understanding users you would try to preserve the state for as many of them as you could by creating a new DB and migrating the available mailboxes.
It is not clear why anyone with a decent array-based solution would bother with lagged database copies. As you undoubtedly know, the backup API was removed from Exchange 2010 leaving only VSS for “backups.” End users are left with two choices to protect their data: Application-aware snapshots through VSS or trusting DAGs + lagged replicas + Personal Archives as a replacement for backups altogether.
With application-aware snapshots, hundreds of point in time snapshots can be kept online versus just 15 lagged copies. Furthermore, the databases are usually locked down in read-only snapshots so that they do not change over time like (even lagged) replicas do. Additionally, each lagged copy is a full copy of the database, whereas a snapshot generally only captures the underlying blocks that changed between snapshots.
Restoring an entire database from snapshot using storage vendor snapshots is usually a very fast operation. Mailbox recovery or even single item recovery is also made easier using Exchange-aware snapshots on the storage device along with third party mailbox recovery tools that do not require you to have a running Exchange server to mount the databases first.
Disclosure: I have worked for 2 storage vendors (3, counting acquisitions) who produce snapshot software (VSS writers / requestors) for Exchange.
Can you explain how the “federated log file truncation” in Exchange 2010 DAG really works? Is a transaction log file associated with the active copy of a database copy truncated only after it is successfully inspected (for lagged copies) by or successfully committed to all passive copies of the database? I recently read a blog that suggests the transaction log files associated with the active copy of a database are truncated only after they are committed to all passive copies of the database. Is this true? My understanding is that even if I back up a passive copy of a database, the details of the backup operation are first written to the active copy of the database and then replicated to all passive copies. If this is the case, the log file truncation should ideally happen first on the active copy and then on the passive copies even if it is one of the passive copies of the database that is backed up. Is this understanding correct?
According to MSDN, which I think is accurate:
“Backup initiated transaction log file truncation will be triggered based on the type of backup chosen. In non-DAG configurations, the Store Writer will truncate the transaction log files at the completion of successful Full or Incremental backups. In DAG replicated configurations, log truncation will be delayed by the Replication service until all necessary log files are replayed into all other copies. The Replication service will delete the backed up log files both from the active and the copy log file paths after it verifies that the to-be-deleted log files have successfully been applied to the copy database and both active database and the database copies checkpoint has passed the log files to be deleted.”
Thanks Tony! if I understand correctly, what this means is a log file is truncated from a copy of a database (irrespective of whether it is an active or passive copy and the truncation lag time set for the copy) if and only if the following conditions are met.
1. The log file has been successfully backed up from one of the copies of the associated database.
2. The log file has been committed to all copies of the database.
3. The log file is past the trucation lag time set on the database copy has elapsed.
Great article, thanks. Given we assume that database corruption due to software or storage failure is increasingly rare, the other reason I’ve been told for having a lagged copy is to protect against database corruption due to a virus – i.e. you get a virus outbreak that writes nonsense all over your mail database before you get the updated signatures.
I don’t think that changes your argument, in that you could still use snapshot backups to protect against viruses. If you use lagged copies, I agree that you definitely need to have tested the documented processes on how you will use them.
Good to hear from you again. When I think about virus outbreaks, I wonder how many times that it actually happens that you’d want to restore a database to a particular point in time to eradicate infected messages. Most AV software today is pretty good at eliminating threat so you’re really considering the kind of “day 0” virus that no one has ever seen before – and in this instance, a snapshot copy is always going to be a faster mechanism to restore to a point in time – after you’ve updated your AV signatures!
I am not a fan of Lagged Copies; they seem like a waste of a perfectly good DAG copy.
BUT, what about in a scenario where you have opted to NOT have any backup process? (eg: DAG with 4 copies of each DB, across 2 physical sites, no backup)
Can you envisage any scenario where a Lagged Copy would help you recover from an otherwise catastrophic failure? Thanks!
I guess I can envisage a scenario where a lagged copy becomes valuable. Let’s assume that we encounter a “day 0” virus – one that we have not seen before and uses a new attack vector that can’t be detected by current AV technology. If the virus is sufficiently virulent, it might infect and make all of our mailbox databases usable. At this point, you could revert to a lagged copy before the point of infection and recover using it. I can’t think of another scenario off hand (apart from maybe a complete datacenter meltdown in conjunction with a discovery that the backups from the last few days are trash), but maybe others can…
Sorry to post off topic but I can’t find a directly related one as you’ll see! As an independant messaging techie caring for corporate email since before Network Courier and MSMUG, I’m loving the new stuff in 2010 as I’m sure you are. The only problems I’m finding are realigning techies view of the changes to appreciate and trying to come up with novel solutions for DAG designs. I’m writing a mini ‘101’ doc to cover the main things to appreciate and would appreciate some input on a seemingly little discussed DAG design option.
On that side, one area I’m having trouble getting any information about is if there is any benefit to using Blank Database Voting Member nodes in a DAG hosted on a mailbox/hub role server? I am considering this compared to rolling out multiple Witness nodes for improved fault tolerance.
I am assuming that the main factor limiting the benefit of the BDVM would be that if deploying more than five databases, the BDVM node would need to be installed on Exchange 2010 Enterprise. Would this be the case? (I realise the hub/mailbox in this case would need to be Windows 2008 Enterprise for clustering anyway)
The worry I have is that this option is either something that is not really of benefit or has been superceded/retired. The hope is that as it hasn’t got a very easy acronym or design name(I’ve just coined BVDM), it will be difficult to get a search return that specifies it thus I can’t find the relevant discussions!
Great blogs, I love some of the comments as I’m usually called in once everything has gone wrong and see so many similar scenarios! “…if you’re a consultant who parachutes in to do a design and departs immediately upon payment…” Class!
All the best,
Glad that you like the blog. I hope that continues!
As to your new concept, I’m at a loss… because I can’t see the point of deploying expensive hardware unless it’s doing something useful (another great consulting phrase), so I guess that I’d prefer to make use of these servers to be DAG members and give them databases to support. Remember, it’s a basic DAG principle that more members servers are better than less and the more that you have, the more work can be distributed, copies created, and disasters rebuffed.
I realised I’d not left the link into the technet doc where I first found the description of this option. T’would have been clearer than my description!
It’s certainly not my new concept, I’m just trying to understand when one would want to use such a design! As far as I can see, even if already using single role servers, adding the mailbox role to hub transport servers would require the OS to be enterprise for clustering services. (And perhaps Exchange enterprise if supporting more than 5 DBs)
Yes it increases the tolerance for the number of servers that can fail before losing quorum but doesn’t seem to provide any further benefit. In that case, as you say, why deploy high cost equipment for what appears to be equivalent to an FSW node?
I was really hoping to find some other discussions on this topic but if you’re not aware of this option either, I’m losing hope!
All the best,
Pingback: The value of lagged copies for Exchange 2013 | Paul's Down-Home Page