Monday, October 24, 2011

Random errors when using MQSeries adapter in Clustered Environment

In the following article, we will be describing an error that we've encountered in a clustered MSCS Cluster with MQSeries installed, using Microsoft BizTalk 2006 R2.

The Problem

Well, it's not exactly one error we've encountered. The problem actually exists of multiple random warnings that seem to be popping up in the event viewer when using the MQSeries adapter (using the MQSAgent2 COM+ component) whilst sending messages towards this adapter. Strangely enough, we also noticed these warnings popping up when there is absolutely no traffic going on on the server. 

Some samples of errors:

The adapter "MQSeries" raised an error message. Details "The remote procedure call failed and did not execute. (Exception from HRESULT: 0x800706BF)"

--

The adapter "MQSeries" raised an error message. Details "Unable to cast object of type 'System.__ComObject' to type 'Microsoft.BizTalk.Adapter.MQS.Agent.MQSProxy'."

--

The adapter failed to transmit message going to send port "prtSendSecLendMsgStatusSLT_MQSeries" with URL "MQS://BDAMCAPP100/MQPRD205/FIAS.QL.SLT_BT_IN.0001". It will be retransmitted after the retry interval specified for this Send Port. Details:"The remote procedure call failed and did not execute. (Exception from HRESULT: 0x800706BF)".

The Impact

It might seem that these warnings are "just" warnings, but whenever one of these warnings is encountered, it causes the sending of messages towards the MQSeries adapter to fail. This will trigger the BizTalk retry mechanism which works with a certain amount of retries. These retries can be scheduled by minutes, where the lowest value of this configuration is 1.

This means that, when an problem occurs and the message needs to be resent, there will be a minimum delay of 1 minute on every message. In an environment where large amounts of messages are being processed, this can be a huge painpoint.

The Solution

After investigation, we seem to have found that the source of this issue can be found on network level of the servers. Idle TCP sessions on the network of the client are closed within 1 hour. This seems to be a common default setting on a lot of environments, and is certainly not a bad thing.

The problem is that the default TCP Keep Alive interval in Windows seems to be 2 hours. So when a Keep Alive package is sent towards the cluster through an idle connection with more than 1 hour idle time, it will return an error since the TCP session on the network is already closed.

Obviously, the most logical thing to do to avoid these errors is changing the Keep Alive timeout time on the Windows (BizTalk) servers. Tis can be done by changing a registry value on your BizTalk servers, followed by a reboot.

The following link describes which value to set:
http://technet.microsoft.com/en-us/library/cc782936(WS.10).aspx

Good luck!
Andrew De Bruyne

No comments:

Post a Comment