I have finally decided it is about time I should start sharing what I learn working with SQL Servers day in day out. Over the years I have used other people’s posts to assist me with solving issues I have been faced with and feel it is about time I should attempt to post my experience. I’ve have not been one to over analyse issues or solutions throughout my career. It has been more a case of get it done and move on to the next task. With this ethos I am going to keep my posts as brief as possible and only detail what works/worked for me at the time, based on the specific environment I am/was using.In this instance I am going to discuss how I resolved high HADR_SYNC_COMMIT wait type within an Always On Availability Groups setup.

So here goes with my first post:

Background to High HADR_SYNC_COMMIT wait type

I was recently involved in an issue where all of a sudden the environment I supported was experiencing failed or delayed customer transactions. This was causing pain not only to the technology teams but both the customer support teams and customers alike.

SQL Server environment

The SQL Server High Availability (HA) configuration is made up of two SQL Server Instances (SQL Server 2014) consisting of a primary and secondary node that use Availability Group (AlwaysOn) technology setup in synchronous-commit mode to manage failover with no data loss.

Issue in brief

Normal transactions were queueing considerably when the index maintenance was running to the extent they were timing out due to the duration limits specified by the applications.

Straight away I could see from looking at the SQL Servers using the SQL Sentry  dashboard, the wait type counter HADR_SYNC_COMMIT was found to be consuming largest percentage of SQL Server time/resources during maintenance. I didn’t need much more evidence to determine that this was the culprit holding back normal business transactions from committing, thus ultimately stopping customers from completing their online transactions.

SQL Sentry Client Dashboard snapshot

SQL Sentry Client Dashboard tab - HADR_SYNC_COMMIT

SQL Server waits (HADR_SYNC_COMMIT represented by Purple)

Solution

The SQL Server Agent maintenance job was using Ola Hahhengen’s scripts with the following parameters:

@FragmentationLow = NULL,
@FragmentationMedium = 'INDEX_REBUILD_ONLINE',
@FragmentationHigh = 'INDEX_REBUILD_ONLINE',
@FragmentationLevel1 = 10,
@FragmentationLevel2 = 30,
@PageCountLevel = 1000,
@WaitAtLowPriorityMaxDuration = 3,
@WaitAtLowPriorityAbortAfterWait = 'SELF',
@MSShippedObjects = 'Y',
@LogToTable = 'Y', @SortInTempdb = 'Y'

--Change: Add Parameters: 
@MAXDOP = 1, @Delay=15

 

 Having made the change to throttle down the CPUs used the rebuilding of indexes took longer to complete (approximately 3-4 times longer) but erradicated blocking and locking. The delay parameter was added just to give SQL Server a bit of rest-bite between index rebuilds.

So now when the maintenance runs the AG Group is not using up as much resources and the primary replica SQL instance has more resources available to commit normal transactions within a timely, as expected, manner.

SQL Sentry Client Dashboard snapshot

SQL Sentry Client Dashboard tab - HADR_SYNC_COMMIT

SQL Server waits (HADR_SYNC_COMMIT represented by Purple – they have gone!)

Post and Pre Change Results

When I opened the monitoring dashboard within SQL Sentry it was immediately apparent that the wait types and queues relating to AG groups had reduced drastically from before I had made the change. In that, the HADR_SYNC_COMMIT waits are non-existent post-change.

Pre Change SentryOne Report snapshot

Primary node                                               Secondary node

KB to Replica/sec, Log Send Queue KB HADR_SYNC_COMMITKB from Replica/sec, Recovery Queue KB - HADR_SYNC_COMMIT

 

 

 

 

 

 

 

Post Change SentryOne Report snapshot

Primary node                                               Secondary node

KB to Replica/sec, Log Send Queue KB - HADR_SYNC_COMMITKB from Replica/sec, Recovery Queue KB - HADR_SYNC_COMMIT

 

 

The AG replica through-put and recovery queues are now lower overall which has enabled usual business transactions to commit as expected, without issue, when the maintenance is running

 

PRIMARY node SECONDARY node
maximum values maximum values
Post Change KB to Replica/sec Log Send Queue KB KB from Replica/sec Recovery Queue KB
Day 4 ~20,000 ~120 ~10,000 ~4,000
Day 3 ~1200 ~60 ~600 ~120
Day 2 ~11,000 ~60 ~6,000 ~4,000
Day 1 ~6,000 ~120 ~3,000 ~4,000
  Pre-change
~40,000 ~600,000 ~25,000 ~1,500,000

 

Going Forward

The maintenance was monitored over several days and issue never occurred again.

If required, the next step would have been to change synchronous-commit mode to asynchronous-commit mode before running the maintenance and changing back afterwards. I didn’t fancy over engineering the solution and putting the system in a different HA mode. But this was discussed with Microsoft product support specialists and confirmed as a legitimate step to take if required…