“The replication agent has not logged a progress message in 10 minutes”
What is happening with my replication?
I had an interesting case couple of months ago, which started with receiving several questions from a customer about SQL Server replication. Some of them were “Should I use replication at all?”, or “Should I consider using replication on satellite network?!” This cough my attention and I dived into the case, I had to find out why the people where so upset about SQL Server replication (I had some ideas though 🙂 ).
Here is the short explanation of the problem that I got from the team (SQL Server 2008 R2):
“After few hours replication has blocked and in the source replication monitor we could see the following error message: The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active.
We tried the different combination of the profile parameters but our efforts didn’t produce any results.
We found that:
- Synchronization schedule Continuous mode is inacceptable. Because of design issue. The history is updated only when the agent complete. This means that we will not have the actual replication state (the monitoring tool will not work) if the mode is Continuous. Another disadvantage is that if Agent fails it will not be restarted.
- If we set HistoryVerboseLevel to 0 (Distribution profile). The replication works fine, but the publisher don’t receive notification from the subscriber and in the end of the retention period (default is 3 days), replication expire with: The subscription(s) have been marked inactive and must be reinitialized. NoSync subscriptions will need to be dropped and recreated. (Source: MSSQLServer, Error number: 21074).
- No connection locks distribution database during occurrence of the problem.”
So what is happening behind the UI?
- The main drama with replication has been always the network. If the network is slow, with high latency and packages dropping, then the replication becomes annoyingly slow. This is the first and the major problem. They need replication over satellite network, where the latency is over 500ms, there are package drops, the bandwidth is low.
- The second problem is the replication agent verbosing behavior and its level of importance for the replication to work. The presumption is if the replication agent cannot log a history record then it doesn’t work. If it does not work, then it doesn’t synchronize, and if this status keeps for the duration of retention setting, then subscription expires and requires initialization. This could be acceptable as a by design rule, but if it so, then why there is an Agent profile of Verbose Level of zero? Beside that when the network is very slow the message logging practically doesn’t happen.
- The last but not least is that the GUI has never made troubleshooting replication an easy task to do. I think that this is an inconvenience that makes people angry when using it.
All the problems have been presented in the case of my customer. Let me focus on the case and the solutions.
Discovering the problem (to GUI or not):
In order to monitor the transactional replication and its transfer rates you don’t need a lot of scripts nor even the Replication Monitor. You just need to query the [distribution].[dbo].[MSdistribution_history] table. Let me describe the important info that you can get from this table.
The Record on row 180 is a summary record that is saved when the TransactionHistory is reaching the configured value of 1000 transactions (I changed TransactionHistoryparameter from 100 to 1000), and it has summary statistics calculated form the beginning of agent session (when agent has started) to the moment of delivering those 1000 trans:
Stats state is an important status information that you can get from the comments column. The state indication are the following: 1 Normal Logging/status; 2 Reader Thread has to WAIT for Writer; 3 Writer Thread has to WAIT for Reader. As you can guess, the distribution agent has two parts (threads) – the reader thread, which is querying the distribution database for the list of transactions, that are candidates for applying to the subscriber, it the executes the sp_MSget_repl_commands and populates the rows into in-memory queue. The ditribution agent writer thread then transfers the commands and applies them to the subscriber. While there could be a lot of reason behind state 2 or 3 (see this detailed article here), our case is definitely writer thread delay due to the network latency.
Other records in the [distribution].[dbo].[MSdistribution_history] are progress history records that are inserted on every 5min (default value of the MessageInterval profile parameter), and are storing cumulative values until the summary record is saved. As you can see I have average transfer rates calculated on every 5 min as 5-7 cmd/sec, which is very low.
The Agent writes a history message on every 5 min and after reaching the TransactionLogHistory parameter value. When there are two history messages missing in the history table (total of 10min) then the error is logged (row 151).
Here is how and when the error message is happening:
Then the Agent is trying to initialize again and if it successfully connects to the Distributor it continues to synch and write messages on every 5min.
Here is the job that performs this check – ReplicationAgent checkup job with a schedule of 5 minutes, defined by MessageInterval parameter:
You have to change the schedule of this agent to 30 minutes or more in order to be reduce and even eliminate those messages especially on slow network. But the most important to understand is that those messages do not cause replication to stop, they are result of having connection and network problems, i.e. they show that the distribution agent couldn’t write the history messages two times and that’s it. It could not write history messages because it looses network connection or it cannot sync or reach the TransactonHistory rows to synch within the current session. The message is a result of network issues but not the reason for reducing the replication performance nor it is a reason for replication to stop.
Actually the error is causing the agent to stop the current session (arguable if it should work in that way by design), but if the replication distribution agent is configured to run on a scheduled basis (should be configured in that way) as on 15-20 min for example, (not on 2-3min), it will restart according to the schedule and will try to initialize automatically, so whenever a network is available the sync process will continue. The same will happen if the error is different: for example network error like broken connection or a process could not connect to server.
So when Agent Error appears the replication as a whole doesn’t stop, the current session stops. It automatically restarts on next schedule and continues the synch process. If the network doesn’t drop or if I lower the network latency (in the satellite network simulator) then the replication is coping up fast and is lowering the value of Undistributed commands.
The reason that the message is not shown with a Historyverbose level of 0 is that in this case the history is not expected to be written in the distribution db, it doesn’t mean that there are no problems with the network or replication is coping up and the performance is great. It just don’t make those checks. I would suggest to not go with level 0 because you became blind to what is happening and cannot monitor the replication at all. You can choose to switch to level 0 in case you don’t perform any monitoring at all.
Replication Settings for slow networks:
There are some definite recommendations as a result of making those investigations of problems:
- TransactionLog History Parameter in the Distribution Agent Profile for telling the Agent when to log a statistics history message. Increase it, I made it 1000 for example.
- HistoryVerboseLevel determines how the logging record is saved in this table in the distribution database, leave it with a value 1
- MessageInterval is important parameter, increase it to 15min (900), 30min(1800) and test
- Always use scheduled, not continuous replication, and try to make intervals long enough in order to give a change to the agent to complete a synch cycle. Another advantage of scheduled mode is if Agent fails due to errors, it will restarts on next time schedule.
- Use a PULL subscription instead of a PUSH, it just performs better
Playing with other parameters of the Agent Profiles
I performed several tests with different values of the Distribution Agent Profile with no great success. You can imagine that the possible tests and combinations could be infinite. I don’t aim to test all the possible value combinations but only those that I think could potentially reflect to some changes (good or bad) in order to find is there an option that could be set to push the replication in one or other direction.
None of those changes made serious difference in performance. SubscribstionStreamsparameter also didn’t help, I think it increase the latencies instead of reducing it. Articles are transferring parameters (status>16) so this is also fixed and not improving performance.
Of course more tests could be performed with different combination of values and monitoring the results and cmd/s rates. You can achieve some performance optimizations (I don’t expect >10%) but you should invest time on this.
The replication performance or transferred cmds/s depends primarily on two thinks:
- Network latency
- Record size
When I reduced the record size I achieved an average transfer rate of 6 cmds/sec against 3 cmds/s when I had 1K record. But the rate increases and decreases depending on the network more than on the profile settings or any other changes. The test has been performed using the following WANem v2.0 satellite emulator settings, which are close to the real environment:
- Network bandwidth 64Kbps
- Network latency 1000ms
- Packets loss 10%
- Packets corruption 3%
But if I try to decrease the network latency then I achieve higher results than everything else together as possible optimization techniques. The problem is that the network and its parameters are something that we have as given values. That is why I was thinking to optimize the next factor reflecting replication performance – the record size or the size of data that is transferred over the network. And I tested the replication of a stored procedure execution, and it performed very well in this environment, it just greatly reduces the amount of data that is sent over the network.
Replicating stored procedure execution
I am replicating the execution of the stored procedure instead of replicating the records that loads them. Instead of sending every record, I am sending the EXEC statement that will execute in the subscriber database:
Exec datagenerator 100, 3
The slowest component of the replication solution implementation as we saw is the Writer Module of the distribution Agent in the part of sending data over the network to the subscriber. When implementing stored procedure replication, I am optimizing the slowest part, which is the writer thread, and prefer to increase the load of the subscriber by executing the procedure there. Replicating stored procedure execution feature is an option with optimizations from SQL Server 2008. The writer thread executes the stored procedure against the subscriber database, such generating the rows there. Actually no table rows are transferred over the wire.
Such a scenario greatly reduces the number of transactions and commands that are transferred, the size of the data transferred, the checks and validations that the reader executes. Beside that the remaining parts of the solution could easily be optimized as they are locally executed.
The results are below. This happens even in an increased latency that trace token shows:
Here is the replication status:
And the distribution agent log (HistoryVerbose=1):
And the Undistributed commands look great also:
There are some specifics that we have to take into account when considering such a solution. The requirements for the database schema are as follows:
- The stored procedure has to be available on both sites
- The objects that the stored procedure refers has to be available at both sites
- You don’t have requirements for PK, uniqeidentifiers, etc, because you don’t replicate tables.
Stored procedure replication is not appropriate for all applications. If an article is filtered horizontally, so that there are different sets of rows at the Publisher than at the Subscriber, executing the same stored procedure at both returns different results. Similarly, if an update is based on a subquery of another, nonreplicated table, executing the same stored procedure at both the Publisher and Subscriber returns different results
Here are some other point to consider: http://msdn.microsoft.com/en-us/library/ms152754.aspx
In general I find this as the best replication scenario in the case of satellite connection between sites if there are no design stoppers, and where the factors that matter are:
- Good manageability
- Out of the box SQL Server solution and available in Standard Edition
My recommendation’s summary:
The usage of transactional replication in my customer’s environment is not impossible. But there are important measures to consider.
The main consideration is about performance and this is the main factor that you have to consider. You have to be aware of the possible rate (2cmds/s ? 1 tran/s) you can achieve for the specific record size and specific network parameters. This is accomplished by tests. In case you produce lot more records as a workload on the Publisher, then the replication will never overtake, because it will cumulate undistributed commands and the records latency will increase and take forever.
In case you produce more records on average (for example an average rate of 50% more per 24h) you can consider either to:
· Use stored procedure execution replication
· Use network boosters/accelerators/network compression hardware
· Use a third party solution that use the completely different technology than the replication