Be careful with VM snapshots of Exchange 2010 servers


Those who are considering virtualizing production Exchange 2010 servers should read the fine print contained in the TechNet article “Exchange 2010 System Requirements“. In particular, this text is crucial:

“Some hypervisors include features for taking snapshots of virtual machines. Virtual machine snapshots capture the state of a virtual machine while it’s running. This feature enables you to take multiple snapshots of a virtual machine and then revert the virtual machine to any of the previous states by applying a snapshot to the virtual machine. However, virtual machine snapshots aren’t application aware, and using them can have unintended and unexpected consequences for a server application that maintains state data, such as Exchange. As a result, making virtual machine snapshots of an Exchange guest virtual machine isn’t supported.”

Eeek! In a nutshell, this means that all support bets are off if you take a snapshot of a running Exchange 2010 server with VMware or Hyper-V and then attempt to revert to the state of the server contained in the snapshot. Don’t expect sympathy from Microsoft support if you ring up to report that things don’t work so well after you’ve used a snapshot to go back to a known system configuration.

In practice, snapshots are fantastic in a lab environment as they allow you to deploy Exchange servers quickly and to go back to a known state if the need arises (you assume that more errors occur in a lab environment that might cause a server to become unusable). In production, snapshots can work pretty well for Exchange 2010 servers that are largely stateless. If you have dedicated CAS or Hub Transport servers, you’ll probably not run into many difficulties if you need to revert to a snapshot of a previous configuration. You might screw up the transport dumpster a tad, but you won’t notice this unless you run into a more serious problem and require Exchange to replay some messages that should be in the dumpster… if the messages aren’t there, you might lose them unless they can be found in another dumpster.

Things are far more problematic with mailbox servers, especially those that operate within a Database Availability Group (DAG). These servers are super-stateful and may be communicating in all manner of mysterious ways, including block-mode replication. Because this is the case, it’s extremely likely that reverting to a previous snapshot of a running and loaded mailbox server will be a sorrowful event. You might run into problems such as the database copies on the server being unrecognized within the DAG, being forced to reseed database copies, or even having the server fail to rejoin the domain or cluster for one reason or another (expired computer password, etc.). All in all, it’s a messy place to be.

Because of the potential for problems it’s best to avoid taking snapshots of running Exchange 2010 mailbox servers. For sure, you can take snapshots of inactive servers (for example, shut the computer down after installing a new service pack and then take a snapshot) but even so, don’t assume that these snapshots can be used to bring a reconstituted server back into production without encountering some glitches along the way.

Problems after reverting to a snapshot is not the only thing to be aware of with Exchange 2010 mailbox servers. You shouldn’t use features like Vmotion to move DAG members to other hosts as this can also cause the DAG to have a severe headache. Microsoft’s perspective appears to be that customers should use the high availability features built into Exchange 2010 and not attempt to change the underlying platform when the DAG will not be aware of the change. This post provides a good overview of the issues involved with Vmotion.

My preference is to use physical computers for mailbox servers. I’ll cheerfully virtualize the rest, including such esoteric components like load balancers, but given the choice, I’ll always go with the comfort factor that a well-specified mailbox server delivers. This is largely a matter of personal choice allied to a suspicion that problems are easier to sort out when things go wrong on a physical box.

Everyone is rightly interested in virtualization because of its potential to increase the utilization of hardware. But the fine print has a nasty habit of catching people who let their enthusiasm run ahead of the capabilities of technology. All the more reason to conduct realistic operational tests of any new server product before bringing it into production so that you know how to deal with different kinds of server outages on both physical and virtual platforms.

– Tony

For more information about Exchange 2010 and the many cool features included in this release, see Microsoft Exchange Server 2010 Inside Out, also available at Amazon.co.uk. The book is also available in a Kindle edition.

Advertisements

About Tony Redmond ("Thoughts of an Idle Mind")

Exchange MVP, author, and rugby referee
This entry was posted in Exchange 2010, Technology and tagged , , , , , , . Bookmark the permalink.

13 Responses to Be careful with VM snapshots of Exchange 2010 servers

  1. Pingback: mailmaster blog » Exchange and VM Snapshots

  2. Jay says:

    Thank you for such an informative post! A great heads up and warning when planning a virtualised infrastructure.

  3. I agree with you! Exchange admins are being mislead, I don’t even think that snap shoting commits the transactions to the Exchange database, such as using an actual Exchange aware backup program, or does it? One guy I just spoke with at a Vmware user group, wanted to Virtualize his Exchange severs that have 3000 users and put them on a IBM Blade server. The Blade server has only one 8GB fiber channel host bus adapter for all the blades, and 4 to 5 VM’s per blade server. Does he expect his end users to be happy with the I/O performance? A small Exchange environment under 200-300 user may be okay with this setup, but 3000 plus, and nearly a terabyte of mail stores?

  4. Mark says:

    Jose, I am running 2000 Exchange 2010 users on my Esxi 4.1 cluster, with 1.2Tb of mailbox disk space, and I currently have 2 8GB fiber adapters in each of the 4 hosts.
    Also, on those 4 VMware host servers, there are a total of 188 Virtual Machines with a total of 12TB’s of disk

    Granted, I have entire LUN assigned just for Exchange Store and CAS servers, but there is no issue with performance at all..

    I see no performance issues at all.

    -Mark

  5. isnt this where quiesced snapshots come to play

  6. Neil says:

    Hi,

    I wish I saw this post earlier….

    I have 2 CAS and 2 DAG servers on Exchange 2010 SP2.

    I took a snapshot of 1 CAS server and installed Exchange SP3 on it. The install went fine.

    Then I took a snapshot of one DAG server and started to install. It was late and night and it failed. So I canceled the install.

    Then I reverted my CAS server back to the snapshot because it has Exchange SP2 on it.

    Everything, email ect is working fine now, but I did notice in the EMC that my reverted CAS server says it is SP3! But when I look in installed programs I notice it still has the Exchange SP2 rollups installed. Therefore I belive the EMC is reporting wrong.

    Any suggestions on how to resolve my issue and upgrade to Exchange SP3?

    Thanks!!

    • I think that I might consider applying the SP3 update to the servers again to move them back to a known situation. Right now, they sound a tad mixed up.

      TR

      • Neil says:

        ok thanks. So I would keep the current of my DR CAS server (that is a reverted snapshot) and and try re-installing SP3 on it? Also, should I install SP3 on my 2 CAS servers then 2 DAG servers?

      • I’d reinstall SP3 on everything in order to bring them back to a known and supported configuration. It’s not good when a CAS server that reports SP3 when SP2 is on it. Mind you, I have not been in this situation before (you are breaking new ground here – at least for me), so some caution might be required. I think EMC retrieves information from the registry to report version numbers so I can’t work out why the rollback of the snapshot didn’t revert to old values. What does EMS report for AdminVersion when you run Get-ExchangeServer?

        Have you logged a call with Microsoft to ask their advice? They might be able to determine what has happened. Just be careful…

        TR

      • Neil says:

        It reports that CAS server 14.3 and the others 14.2. I will try drainstop the one CAS server and try to install exch SP3 on it….

        If that does not work I have another VM ready try Exchange recover on. I was told this might work. Run the Exchange SP3 install on the new VM (only OS on VM)

        (Exchange2010-SP3-x64) extracted it so I would run as “setup.exe /m:recoverserver”?

        If that all fails I will be calling Microsoft for support…. Also, should I install SP3 on my 2 CAS servers then 2 DAG servers?

      • The answer (honest) is “I don’t know”. I am loathe to give advice via this kind of interaction because I simply don’t know enough about your site, configuration, environment etc. Also, I am not a support professional and don’t have the toolset to help solve problems of the kind that Microsoft Support uses. End of health warning.

        All that being said, the CAS servers should be easier to update. I would go ahead and update them to get them to SP3. Once they are done, you can proceed to the DAG servers. Seeing that they are at SP2 and were not half-upgraded, the upgrade process should proceed smoothly.

        TR

  7. Neil says:

    Thanks for the info Tony greatly appreciated!! I will give it a go next Friday…

  8. mahesh says:

    you are a life saver. thanks .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s