LessThanDot

Availability Groups have come on strong as a true Enterprise high availability feature in SQL Server. While the setup and maintenance of availability groups is fairly straight forward, adding other features into the mix can make operations quite complex and require additional preparations to ensure operations continue, in certain failures. One of those features that may be added with availability groups is transactional replication.

Transactional replication and availability groups do get along well enough to function successfully and add great value to business operations. There are considerations, such as the failover partner agent options and a few others to consider. More importantly, the operating state of replicas is something to take into account. For example, in the case of asynchronous disaster and recovery replicas, replication and the log reader will be relying on the same synchronization states in order to proceed with transactions.

Take the following example – Availability group setup for AdventureWorks2008, 3 replicas, two acting in automatic failover and one in asynchronous replica state on a secondary subnet. We will not go into detail on setting this configuration up at this time but the topology and configuration is what we want to focus on.

Within the availability group set, transactional replication has been setup from NODE1 to NODE2 for Person.Address. This was done for secondary business reasons. The subscribing database is not part of the availability group.

This setup is one of the more common combinations of availability groups and replication. The subscriber location is not important but the publishing aspect is.

Focusing back on the DR replica; if the asynchronous state becomes too far out of sync or the DR replica goes offline, replication will begin to fail with the following error.

“Replicated transactions are waiting for next Log backup or for mirroring partner to catch up.”

Which is found by querying the MSLogreader_History table

This is protection to ensure all replicas are in a healthy state and protection from replication performing actions that may not be in the best interest based on that replica state. However, in our case, an asynchronous replica was taken offline so in reality, we would not want replication to stop due to the event.

To show this further, the following test can be performed.

With the availability group mentioned previously and the publication, the following code is started in a new query window. This code will start updating random address rows that are part of replication.

DECLARE @LOOP INT = 1

WHILE @LOOP <= 100000
 BEGIN
	UPDATE Person.Address
	SET AddressLine2 = AddressLine2 + CAST(RAND() AS VARCHAR) + ' These are tests'
	WHERE AddressID = (SELECT TOP 1 AddressID FROM Person.Address ORDER BY NEWID())
 SET @LOOP += 1
 WAITFOR DELAY '00:00:02'
 END

Looking at the log reader history, transactions and commands start to replicate normally

Now to test a failure event of the asynchronous node for DR, it is taken offline completely with a shutdown. As shown below, within the normal 60 second operating speed of replication, the error mentioned previously appears.

The two options we have are to bring the asynchronous replica online and let it catch up or, turn on trace flag 1448 and essentially ignore the state of the replica. We’ll turn trace flag 1448 on.

DBCC TRACEON(1448,-1)

Almost immediately, replication moves on

The next consideration is, what will happen when the 3^rd replica is finally brought online or rejoins the availability group? In the test case, the replica is brought online and comes back into sync while not interrupting replication, due to trace flag 1448 being on still.

We can see that the 3^rd replica is far out of sync from looking at sys.dm_hadr_database_replica_states

When bringing the 3^rd replica back online while the time has not lapsed so long the recovery point cannot be performed, we can see the replica is brought back up to sync.

It’s important to note that if the 3^rd replica does come online and 1448 flag was not set, replication will still need to wait for the replicas to catch up before allowing replication o move on. In this case, setting flag 1448 on still may be needed. This may be needed for publications that are sending a great deal of data while replicas are commonly out of sync in an asynchronous state.

Summary

While running availability groups combined with log readers form transactional replication, weigh heavily on using the trace flag 1448 when a replica is in asynchronous operating mode. This trace flag can prevent replication from stopping when a replica loss that truly should not affect replication, is lost.

As with all high availability and disaster and recovery configurations in SQL Server, test them well and test them often. You do not want to be presented with a failure event and the answer not known simply due to not testing each scenario that may prevent data services from continuing.

Async replica loss causes transactional replication to fail

About Ted Krueger (onpnt)