Office 365 support case resolved – thankfully!


On April 29, I reported the poor support experience I had received as a result of the upgrade of my Office 365 tenant domain from the Wave 14 release to Wave 15. Essentially, a support call reported on April 8 had produced zero progress, despite many messages to and fro between myself and Microsoft’s Office 365 support team. All in all, it was a tiring and frustrating time.

Four hours after posting the report, I was contacted by a UK-based Microsoft escalation engineer. Coincidences do happen, but in this case I think that the public protest had the desired effect on Microsoft’s bunged up support processes. In fact, it’s depressing that posting a blog produced an escalation because it points to a problem in the support process. Normal customers who don’t blog won’t get the same response. It is probable that my visibility within the Exchange community as someone who writes extensively on the topic also assisted in the escalation process.

The good news is that at 22:30 on April 30 Outlook informed me that it had to restart because of a change made by the administrator (in fact, Outlook forced me to restart 3 times, for a reason that I haven’t quite figured out). I logged into my tenant and discovered that OWA used the Wave 15 interface and that all the administrative functions worked as expected. ActiveSync and EWS clients connected flawlessly to the upgraded service. The problem was solved 22 days after being first reported.

Joy! Something might have happened...

Joy! Something might have happened…

What have I learned from the experience? Here are some thoughts:

Microsoft front-line staff are just a filter. No surprise here because all major support organizations use front-line staff to filter incoming calls, solve the most obvious (and some that are not), and pass a certain percentage to second-level support via an escalation process. What surprised me about this case was how long Microsoft allowed the call to remain at the first level despite frequent communication back and forth with me. I asked repeatedly for updates but nothing happened. Clearly the internal escalation process did not function properly.

Microsoft escalation engineers know their stuff (at least, the person I dealt with did). Once the case was escalated things happened more quickly (as you’d expect). The focus was sharper, the questions more pertinent, and action occurred. Tools such as those described in KB2598970 collected information from my workstation to help detect the source of the problem. Communications were restrained and content rich. All in all, a much better experience.

Expect a delay if something has to change in the datacenter. Second level support can go so far with massive cloud systems. Their role seems to be to investigate problems, collect information, and then figure out what needs to be done. In this case a change needed to be made to my tenant domain. Unlike what might happen in an on-premises situation, senior support staff cannot take actions to user accounts (or their equivalents) because Office 365 is, by necessity, an extremely locked down environment where only specific people can interact with user data under controlled conditions. The upshot is that some delay is built into the system to have information fed back to the datacenter team and for them to respond. I like this because it shows that Microsoft is serious about protecting customer data – no shortcuts are taken to solve problems that might compromise data.

The service keeps on running even when back-end migration problems happen. I reported the problem in April 8 and it was resolved on April 30. Sounds bad. But all clients continued to function properly and access Exchange, Lync, and SharePoint during this period. An end user would not have known that anything was wrong. I think that this must be the situation with many Office 365 issues because if something really does go wrong then huge numbers of people are affected. In this case, a partial migration had resulted in a Wave 15 administration front-end attempting to talk to Wave 14 servers at the back-end. The different protocols involved caused the error. As it turns out, I’m told that the problem originated when my tenant subscription was changed last year and that this has uncovered a problem that Microsoft will now fix.

Document everything. This advice is often given to people who experience the joys of reporting a problem to support. You have to know and record your facts because you will be asked about them. Facts help identify where the problem might lie and how it might be solved. Write everything down, including the details of the interactions with the support team (time, date, and duration) as you might need to use this data to force an escalation.

The bottom line is that my Office 365 tenant domain is now back to full health. I am genuinely surprised that it took so long for Microsoft to solve the problem but am glad that things eventually worked out. It’s just a pity that it took so long to resolve and that escalation only happened after the incident was exposed to the full glare of publicity.

I doubt that many other tenant domains will be in the same situation. Office 365 has not really been around long enough for many companies to switch subscription types and Microsoft is now aware of the issue and will fix it. But I sure hope that the folks who run Office 365 support take action to improve their escalation processes so that other customers do not experience the same kind of extended case resolution as occurred here.

Follow Tony @12Knocksinna

Update 2 May: I was called this morning by a Microsoft customer support manager to discuss the problem and how Microsoft worked as the issue unfolded. I thought that the discussion was very open and helpful, which is always a good thing.

Posted in Cloud, Email | Tagged , | 4 Comments

Upgrading Office 365 to Wave 15: My support experience to date


One of the great promises held out by cloud-based services is that you do not have to worry about software upgrades and other common maintenance operations as the service operators will take care of these mundane operations. In the eyes of the marketing staff, all you have to do is use the service and take advantage of new features as they are “lit up” through software upgrades.

Earlier this year, Microsoft flagged that they were preparing to roll out the Wave 15 products to Office 365 tenants, who do not get to vote about when the upgrade happens. Instead, some process running inside Microsoft decides what date a tenant should be upgraded. So be it, that’s what you get when you buy and download Microsoft office… You give up a lot of control when you sign up for a cloud service and trust that those running the service will do the right thing when upgrade time rolls around. In my case, the chronology of the service upgrade for my tenant domain was as follows:

5 March: received the initial notification proclaiming “We’re upgrading your Office 365 service in 2013”.

Well, I knew that the upgrade was coming, but it was nice to know that soon I would be using the Wave 15 products, especially Exchange 2013.

19 March: another note arrived saying “New features are coming to your Office 365 service soon

Tension building now. I could not wait to use Exchange 2013 within Office 365. And then:

Tremendous! The Wave 15 upgrade is complete - or is it?

Tremendous! The Wave 15 upgrade is complete – or is it?

29 March: yet another note to say that “Your Office 365 service upgrade is done – sign in and explore

Hmm… Not much had changed when I connected to Office 365. At least, not until approximately 18:00 on March 30 when I noticed that the Office 365 admin portal boasted the new Wave 15 branding, even if some errors were reported. However, OWA stubbornly displayed the familiar Wave 14 interface.

Errors reported by the Wave 15 Admin interface

Errors reported by the Wave 15 Admin interface

The next change happened on April 2 when the Office 365 login page was updated to a much more colorful edition. Surely this was a sign that the elusive upgrade had completed? But no, OWA was still connected to Exchange 2010.

5 April: the Office 365 team invite me to “Tell us about your experience with the Office 365 service upgrade

Of course I took the opportunity to provide feedback but could not find the appropriate space to tell Microsoft that whereas they might consider my service to be upgraded, I did not. Several other errors interrupted my attempt to provide feedback so I let the chance lapse. Life is too short to waste time on badly functioning feedback loops.

By now I was worried. Casting aside the mental excuse I had constructed that the migration process must be very complex and simply needed some time to complete fully, I noticed that errors were reported when I attempted to manage either Exchange or SharePoint. OWA on the other hand, continued to work beautifully as did Outlook and ActiveSync. I could therefore have simply ignored the issues but decided that now was the time to engage with Office 365 support.

My support call was logged on April 8. One of the aspects of being a very small consumer of a very large service is that you simply have to wait your turn to receive service. It’s not as if a Microsoft support agent is ready and willing to leap into action immediately, especially when you only have a Plan P subscription. Life is different for larger enterprises that pay considerably more for Office 365, but I suspect only marginally. They might have a local Technical Account Manager (TAM) to shout at when things go wrong, but once support calls enter the black box of Office 365 it’s hard to find out what the real situation is with any issue.

I received a call back on April 9. The agent was perfectly pleasant but possibly used to dealing with people who might not have quite as much experience with Exchange as I have. But then again, first line support staff tend to have to follow a scripted engagement with callers to ensure that all bases are covered. I, on the other hand, knew that the Admin side of Office 365 exhibited all the signs of Wave 15 branding whereas OWA stubbornly remained connected to an Exchange 2010 mailbox server. After 30 minutes or so and after running some PowerShell commands, the fact that the upgrade wasn’t complete was determined to the satisfaction of all concerned. Or at least, enough evidence existed to allow an escalation to the “server team”, who possess a more elevated position in the Office 365 support hierarchy.

For the record, this command proved that the update had not completed:

Get-OrganizationConfig | Format-List Name, Admindisplayversion, IsUpgradingOrganization

Name                    : xxxxx.onmicrosoft.com
AdminDisplayVersion     : 0.10 (14.16.190.8)
IsUpgradingOrganization : False

As you can see, the AdminDisplayVersion still reports version 14 where an upgraded tenant that runs the Wave 15 products would report something like 15.0.586.12 to indicate version 15. Interestingly, IsUpgradingOrganization is False, which normally means that an Office 365 version upgrade is complete.

To be fair to the Office 365 support agent, I explained that I was an Exchange server MVP and that I also wrote about Exchange on a reasonably frequent basis. This happened after she sent me a set of EHLO blog posts to explain the wonders of Exchange 2013, a topic close to my heart.

The news that something had broken in the tenant domain used by someone who might write about the experience must have filtered upward in the support organization as I was then contacted by the support agent’s manager later on April 9. I was asked whether I was happy with the progress of the support case, to which I replied that not much progress had been seen and I was awaiting developments.

I stayed in that mode until April 18, receiving intermittent messages from my friends in Microsoft Support to say that not much was happening. After ten days of waiting for a resolution, it seemed fair to look for an escalation, so I emailed the support manager to ask for her help to move things along. No response was received, so I emailed again on April 22, a full two weeks since the support case was logged. This elucidated a response and I was told that the server team had decided to escalate the case after a week of contemplating the situation.

Nothing much happened over the following two days, so I emailed the original support manager on April 24 to point out that the escalation seemed to be stuck and that the call was now open for sixteen days and was not helping to improve their call close statistics or customer satisfaction rate.

No one called me over the next two days, so I sent another message on April 26 to ask what was happening. Although my Office 365 tenant domain remained fully functionality from an end-user perspective, the loss of some admin functionality had begun to be a real concern. Microsoft responded to say that their server team was currently swamped with problems (ahem…) but that they would try to get the case moved forward. As I write, three days later, no one has been in contact to communicate the current status for the case.

As an ex-CTO for both Compaq Services and HP Services, I have some awareness of what happens in a support organization. Indeed, I even served some time on a European support team for Digital in the mid-1980s. It seems to me that this case has been poorly managed and that Microsoft should ask:

  • Is their escalation process efficient? Why did their systems fail to escalate the incident automatically to the next level of support after a certain period? Surely their problem tracking systems identify cases that are still open and active after five, ten, or fifteen days?
  • Is their communication with customers effective? Based on my experience, I do not think so. Good communication keeps people in touch and conveys information about progress.
  • Are their support personnel sufficiently well-trained and are their front-line managers aware of the details of the technology that they support? Again, I find fault here. Everyone who I have spoken to has been easy to deal with without ever leaving me with a feeling that they understood the issue and knew what needed to be done to resolve the problem. Cloud systems can only function when they are standardized to a very detailed level. I imagine that this makes support easier than for on-premises systems when implementations are left to the imagination and competence of the local administrators.

I really wanted this migration to work because I have an interest in using some of the Wave 15 functionality. It’s sad that the experience has been so bad. If anyone in Microsoft Support would like to investigate and find out just what happened in this migration epic and perhaps even move the problem forward toward completion, the case number is SRX1202350062ID. After three weeks it would be nice to see a resolution.

Follow Tony @12Knocksinna

Posted in Office 365 | Tagged , , | 128 Comments

Exchange 2013 Inside Out appears on Amazon


The wheels of the publishing market turn in mysterious ways. At least, their ways are mysterious to those who don’t publish books, including the authors who actually write the text. Earlier this month I let you know that O’Reilly will soon release draft chapters of my book Microsoft Exchange Server 2013 Inside Out: Mailbox and High Availability. These are chapters that have been through a copy edit and technical review, but are still not quite finished for various reasons. For example, some questions to the development group might not have been answered or we are simply waiting for the next cumulative update to appear to see what it might bring.

O’Reilly will also make some draft chapters available for the companion book, Microsoft Exchange Server 2013 Inside Out: Connectivity, Clients, and UM, by Paul Robichaux. Paul has been teaching the Unified Messaging component of the Exchange Ranger training class for many years so it should come as no surprise that this book contains the best discussion about the topic that I have ever seen. For the record, to keep everyone honest, Paul is doing the technical edit for my book and I am doing the same for his. Apart from anything else, this arrangement makes sure that we see the content and make sure that there’s no overlap or duplication across the two books.

Getting back to the mysterious ways, it seems strange that Amazon has now published the availability of both books in their online store some six months before the final pages are printed. That being said, please do check out Microsoft Exchange Server 2013 Inside Out: Mailbox and High Availability and Microsoft Exchange Server 2013 Inside Out: Connectivity, Clients, and UM. Paul has already commented that he can’t understand why Amazon has priced his book at $30.85 while mine costs $34.07. Apart from the obvious retort that mine is much more interesting than his (well, about $3.22 more interesting), Paul’s book is 600 pages while mine is 800. You definitely get more pages for your dollar with me while each of Paul’s pages contains premium content. Or something like that.

Given that we essentially have books ready, why wait until the October 22, 2013 date promised by Amazon? Well, we could rush the books out and have them available in the near future, but that removes the chance of learning just how Exchange 2013 actually functions in real deployments. Every day we learn more about the quirks of the product and these are the important facts that become the “inside out” referred to in the title.

In addition, Microsoft has a habit of updating the way that Exchange works as a version matures. We have already seen updates (such as the reintroduced ability for groups to manage groups) in Exchange 2013 CU1 and more are likely as CU2 and CU3 appear over the coming months. We want to track and report on these changes, insofar as is possible.

Hopefully you will enjoy the books. At least, that’s the plan.

Follow Tony@12Knocksinna

Posted in Exchange 2013, Writing | Tagged , , | 7 Comments

Citer – good service for a change from a car hire company


Car hire companies, especially those who operate at airports, are often criticized for poor service and some pretty tacky habits, such as their desire to charge renters for a full tank of fuel when the rental commences even when the renter knows that they will only drive a few hundred kilometers and will not empty the tank. Or indeed the ever-popular allegation that the fuel tank was not full when the car was returned, necessitating an instant retrospective charge for a “fuel service”.

It’s important to recognize good service when you find it and this year we have rented cars three times from Citer/Enterprise at Nice Airport (NCE) and have received excellent service each time. The staff working at Citer have told us that their work practices have changed since Citer became part of the Enterprise Rent-A-Car organization in November 2011, so this might just be an example where an acquisition results in better customer service.

All is not perfect with the Citer experience as we have often experienced queues at their desks in the Nice Airport rental facility. Part of the reason for the delay in dealing with clients seems to be interminable phone calls made by the agents to all and sundry, seemingly a necessary part of the process to secure cars. But part of the delay is also due to the way that Citer staff walk customers to their car to check it out, explain how the car functions if necessary, and make sure that all existing damage is correctly noted. I’m sure that the Citer staff cover a lot of ground between rental desk and garage over the course of a day, but the impression made on customers is much better than the norm delivered by Citer’s competitors such as Avis and Hertz.

Another nice thing about Citer is that they offer many deals that include a second driver. Most of the other car hire firms that operate in Nice delight in

Citer’s current car fleet in Nice features a lot of Hyundais. We have driven a very nice i30 (it was brand new) but on the other hand, the last time out we had a tatty Lancia Delta that had suffered many dents and scratches in its 28,000 km rental career. I owned a Lancia Delta in 1982-83 and had fond memories of that car but its modern counterpart did not make the same impact and I doubt that we will choose a Delta again.

If you need to rent a car in Nice Airport in the near future, consider trying Citer. I usually use AutoEurope to find the best deals and scan down the (often) hundreds of packages to find the best one that’s available. Not that anything especially compelling will be available soon as the Cote d’Azur holiday season swings into full blast and car rental rates escalate.

Follow Tony @12Knocksinna

Posted in Travel | Tagged , , , | 3 Comments

Exchange 2013 Inside Out: Mailbox and High Availability makes an appearance


The nice people at O’Reilly Media have posted details of book one of the two-part Exchange 2013 Inside Out set. My book covers the mailbox server and high availability while Paul Robichaux is deep in the process of writing all about the client access server, clients, and other wonderful topics including unified messaging. I’m sure that O’Reilly will get to putting up some details about his book soon.

We’ll soon be making preview chapters available. These are chapters that are in the midst of the editing process. As such, they might contain errors. In fact, I’ll guarantee that they do because, despite several reviews, eradicating errors from text is an ongoing process when the software changes, as Exchange 2013 did recently when CU1 was released. Paul and I are attempting to keep the text updated as Microsoft upgrades Exchange but, as you can imagine, this is not particularly easy.

Ex2013InsideOut

In any case, it’s nice to see the book entering the final stages. Much of the writing is now done. All that has to happen is updates, fact checking, revisions, indexing, more copy editing, fights with the editors about page counts: the normal kind of thing that happens to bring out a book.

Follow Tony @12Knocksinna

Posted in Exchange 2013 | Tagged | 10 Comments

Exchange 2013 CU1 appears


Exchange 2013 CU1 has now appeared. My review of the new release is available on WindowsITPro.com, where I have also posted some notes on the approach needed to update mailbox servers that are members of a Database Availability Group (DAG).

Happy Reading!

Tony

Posted in Exchange 2013 | Tagged | 1 Comment

Exchange 2013: Stuck messages in OWA’s Drafts folder and DNS


One of the common things that OWA users notice about Exchange 2013 is that outgoing messages sometimes appear to get “stuck” in the Drafts folder. Not only do messages seem to linger in Drafts, no trace of the outbound messages ever shows up in the Outbox. Or so it seems, but really it’s an urban myth.

Exchange 2013 boasts a new architecture. The hub transport server role is no more and its processing has been subsumed into the mailbox server role. In turn, as the TechNet description of the Exchange 2013 mail flow makes clear, the Mailbox Transport service and the Transport service work together to process messages sent by clients.

Exchange 2013 Mail Flow (source: TechNet)

Exchange 2013 Mail Flow (source: TechNet)

But how does the Drafts folder come into the picture? Well, OWA clients automatically capture copies of messages as they are being composed and store them in the Drafts folder. When the user issues a sent command, the Mailbox submit agent (running within the Store driver) takes over and processes the outbound message by giving it to either the Transport service running on the same mailbox server or to the Transport server running on another mailbox server. The connection is made via SMTP.

Messages stay in the Drafts folder until they are successfully sent by being processed by the transport service. At this point, items are moved into the Sent Items folder. OWA 2013 behaves in the same way as OWA 2010 – nothing has changed in the way that messages are held in the Drafts folder until dispatch. What might account for user descriptions of items being “stuck” is when a problem occurs somewhere in the transport pipeline that prevents outbound messages being processed.

For instance, items will remain in the Drafts folder if the Store cannot pass them to the transport system. If the transport service is not running on any available server or the mailbox transport service is not running on the mailbox server that hosts the active database for the user’s mailbox, items will stay in the Drafts folder until the services come online and Exchange is able to process outbound items.

Now, the normal state of events is that all of the Exchange services are running along quite happily on the server. Certainly, if a service fails or is not running for some reason, it’s likely that the administrator will notice that this is the case and fix the problem. What else would stop transport being able to process outbound messages and force the Store to keep them in the Drafts folder?

Checking DNS Lookup properties for a server with EAC

Checking DNS Lookup properties for a server with EAC

Incorrect DNS binding to server NICs is one of the likely culprits. Unless the Exchange 2013 servers know how to route messages, the items stay where they are. Like any email server, Exchange makes heavy use of DNS, so it’s logical that if DNS is not configured properly, then messages are not going to be transported to either internal or external destinations. If users report “stuck” messages, you might just want to take a look at server properties with EAC to make sure that DNS lookups point to the right place (a server can that resolve the lookups). You can also check with EMS by running the Get-TransportService cmdlet to retrieve the ExternalDNSServers and InternalDNSServers properties.

If the server properties reveal that DNS lookups are going walkabout, you’ve just found the problem. On the other hand, if the DNS configuration is correct, you might have to talk to Microsoft Support to see why transport isn’t working as expected. I hear rumblings that Microsoft has improved the way that Exchange interacts with DNS in Exchange 2013 CU1 but we shall have to wait for that release to verify if this is correct.

Outlook processes outbound messages differently because it does use the Outbox folder. Remember that Outlook can be a client for many different versions of Exchange and other email servers. Where OWA keeps messages in the Drafts folder, Outlook continues to do what it has done for years and moves outbound items through the Outbox en route to Sent Items. When messages are stuck in the Outbox, it’s probably due to another factor such as messages being too large for the server to accept. In effect, items in Outlook’s Outbox folder have the same status as items in OWA’s Drafts folder – both are candidates to become outbound items that will be processed by the transport service.

Messages don’t get stuck in the Drafts folder without good reason. It’s not as if Exchange wants to keep messages there. After all, it is an email server after all… and email servers that don’t send messages would not be much good!

Follow Tony @12Knocksinna

Posted in Exchange 2013 | Tagged , , , , , , , , , | 49 Comments

Microsoft CVP on the current state of Windows Phone and its competitors


Terry Myerson, who previously ran the Exchange development group, is now the Microsoft Corporate Vice-President (CVP) for Windows Phone. He has always had strong opinions and it came as no surprise that Terry would voice some trenchant views when he was interviewed at the recent Mobile World Congress (the video stuttered badly when I viewed it, but the program is only five minutes long so stick with it).

Among my favorite comments were:

We’re ahead of iPhone in 7 markets and ahead of Blackberry in 26“. No detail was offered as to where these markets are exactly – the assumption is that these are individual countries. If so, it would be interesting to know where Windows Phone 8 is beating out iPhone. [Update March 28: According to ZDNet, IDC reported that Windows Phone pipped iOS in “Argentina, India, Poland, Russia, South Africa and the Ukraine. The seventh market was a collection of countries, including Croatia, that IDC labels “rest of central and eastern Europe“.]

The interviewer put the Windows Phone market share at between 3 and 4% and Terry didn’t disagree. Terry also said that Microsoft had seen tremendous progress over the last nine months and that one billion app downloads had been made from Microsoft’s Store (not much compared to Apple, but a start). The interviewer suggested that BlackBerry was Microsoft’s closest rival, but Terry disagreed, saying that their “sights were higher”.

Android is a confusing mess” and “iPhone is boring now“. Apparently live tiles make all the difference. I think he is right that the iPhone user interface has started to show its age; he’s also right that the diversity of Android across devices, manufacturers, and software versions can be confusing at times. However, consumers just care whether their phone works and supports the apps that they want to use, which is the huge strength of the iPhone in particular. These days, the best phones under 10000 are not exactly hard to find, the issue is if they can function with today’s software.

The major strength of the WP8 platform was cited as the ability to access the same content across multiple devices, perhaps a reference to SkyDrive. However, the Apple contingent can point to the way that iPad and iPhone share apps, music, and video with Macs using iTunes – data formats that are probably most interesting to consumers whereas the thought of being able to access an Excel worksheet or PowerPoint presentation on SkyDrive is more valuable to the business folks. A more interesting comment came in the reference to the camera and photographic capabilities delivered in Nokia devices, specifically the Lumia 920. Terry also said that WP8 did a better job for lower end phones than low-quality Android devices.

I haven’t looked back since I moved from iPhone to WP and am still happy with the Nokia Lumia 800 (now upgraded to WP 7.8). Nothing in iPhone 5 makes me want to move back and I still can’t get my head around using an Android (which one?). I guess I can stay on the sidelines for a little longer before deciding how to upgrade.

Follow Tony @12Knocksinna

Posted in Technology | Tagged , , , | 3 Comments

Exchange 2010 SP2 RU6 “can’t delete messages” KB released


Microsoft has released KB2822208 in response to the problems that users reported where they were unable to delete messages when working with Outlook in online mode.

After a great deal of hard word to investigate the possible root causes, Microsoft has determined that the issues lie with attachments generated by third party add-on products for Exchange. The KB mentions two types:

  1. Unable to soft delete messages that contain voice mail attachments.
  2. Unable to soft delete messages sent from FAX server, printer or scanner which have attachments (such as .PDF).

The problem first emerged when users reported that they couldn’t deal with messages containing PDF attachments generated by multi-function printers, so Microsoft’s KB is in line with real-life experience. The problems with voicemail surfaced soon afterwards and seem to be associated with Cisco Unity rather than Exchange’s own Unified Messaging.

No software fix is available yet, so all you can do is either use Outlook in cached Exchange mode (the delete actions are processed locally and then synchronized back to the server) or perform a “hard delete” by pressing the Shift/Delete combination. This action bypasses the normal processing Outlook does when deleting messages (a soft delete which moves items into the Deleted Items folder) and uses a different code path to avoid the problem. Messages that are hard deleted go into the Recoverable Items folder instead of Deleted Items.

I’m sure that the Exchange team is working hard to figure out a better fix that can be implemented in software. For now all you can do is use the workarounds.

Follow Tony @12Knocksinna

Posted in Email, Exchange, Exchange 2010, Outlook | Tagged , | Leave a comment

Exchange 2013: Using crimson events to track transaction log truncation within a DAG


Prior to Exchange 2007, transaction log truncation (removal) used to be a pretty simple affair. A successful good backup would truncate the log set or, if you used circular logging, Exchange used a small set of logs to capture transactions and would reuse the files once the transactions in the logs were committed into the database.

The advent of Cluster Continuous Replication (CCR) in Exchange 2007 (plus its variants, LCR and SCR) and more importantly, the Database Availability Group (DAG) in Exchange 2010 required the way that transaction logs are truncated to change. Exchange 2010 breaks the connection between databases and servers and multiple copies can exist within a DAG. Replication of transaction data between servers complicates matters a tad because of a rule that no data should ever be removed if it might be needed. Transaction logs exist to update databases, so it follows that Exchange never truncates logs if the chance exists that the logs might be required to update a database copy.

Some like to operate databases without circular logging. This is fine, as long as you have sufficient disk space to hold the logs. The old best practice of not using circular logging with production databases has evolved as It is quite safe to use circular logging within a DAG, providing that sufficient copies exist to provide a reasonable guarantee of redundancy. Three copies is good; four is even better. And indeed, once database copies are in use, a different form of circular logging called CRCL is used.

TechNet explains the conditions that must exist before transaction logs are truncated by the Replication service:

  • The log file must have been successfully backed up, or circular logging must be enabled.
  • The log file must be below the checkpoint (the minimum log file required for recovery) for the database.
  • All other lagged copies must have inspected the log file.
  • All other copies (not lagged copies) must have replayed the log file.

In addition, the article also explains that the transaction logs for the active database copy are never truncated when one or more copies are suspended.  Even more care is taken when a lagged database copy is maintained because the transaction logs are retained for the lagged period and cannot be removed until that period expires and the Replication service replays the logs to update the lagged database copy.

All of this is quite clear, but when you look at the accumulation of transaction logs for a database, you might wonder on what basis the Replication service decides to truncate the logs. Because Exchange 2013 uses quite a big checkpoint depth (100MB), it’s usual to find a hundred or more transaction logs even when circular logging is enabled and the database is essentially quiescent. It’s far removed from the five or six transaction logs that a standalone database enabled for circular logging might use.

Truncation occurs when the “LogTruncator”, a component running inside the Replication service, examines the log set to assess what logs must still be retained. This happens on an ongoing basis and when a database is mounted. Some insight into the decision process can be gained by examining the “TruncationDebug” crimson channel in the Event Viewer. Exchange 2010 began to use crimson events to capture important information about high availability processing; Exchange 2013 captures a lot more information. In the screen shot, you can see that TruncationDebug is under the HighAvailability section. Three interesting events provide some insight into what the Replication service does when it examines logs to decide whether they can be truncated. In sequence, the events are:

  • 223: The Replication service gets information from servers that host database copies about what log files they have processed. In the screen shot, the information coming back from the EXSERVER2 server indicates that it wants generation 3557 to be preserved for its database copy. This is the minimum log file required for recovery.
  • 224: The Replication service decides what logs can be truncated.
  • 299: The Replication service truncates the log stream and removes whatever transaction logs are no longer necessary.
Crimson channel event reporting log truncation

Crimson channel event reporting log truncation

In this instance, the database copy existed on only two servers (definitely not too redundant) and so the Replication service only had to take into account input from the two servers. One (as we’ve seen) advised that it needed generation 3557, so to be safe the Replication Service truncated to generation 3555.

Dumping a transaction log header to check its generation

Dumping a transaction log header to check its generation

On a practical level, any transaction logs belonging to generation 3554 or below are removed from the server, so when we look at the transaction log directory, we should see logs for generations 3555 and higher. Of course, Exchange uses a hex numbering scheme for transaction logs, so the log number is DE3, with a full file name of E000000DE3.log. You can validate the generation number for a transaction log by running ESEUTIL with the /ML switch as shown above.

Exchange takes enormous care to make sure that transaction logs are retained until they are no longer required. Given that DAGs can vary so much in construction from simple 2-member implementations right up to the sixteen-member mega-DAGs, it’s obvious that log truncation can be a tricky business. Fortunately the technology works very well in the background. I wish that the same statement could always be made about technology…

Follow Tony @12Knocksinna

Posted in Exchange 2013 | Tagged , , , , | Leave a comment