Sitworld: ITM Communications Validation

ITM Communications – Manual Validation

by

John Alvord, IBM Corporation

jalvord@us.ibm.com

Introduction

ITM Communication Services has requirements. When the requirements are not met things break in strange and non-obvious ways. Most communication is via TCP Socket links. After setup these are used to implement Remote Procedure Calls. This often works beautifully by default but in new environments it pays to perform some manual checks. It is also helpful when processes do not connect.

Manual Validation

Lets review the case where a hub TEMS has already been installed and is working. A new remote TEMS is installed and we want to validate the network is prepared. Usually after problems are resolved in one case, many cases are resolved.

1) The remote TEMS needs to know where the hub TEMS is located. This control is a file created during an install named glb_site.txt which is located in the

Windows: <installdir>\cms

Linux/Unix: <installdir>/tables/<temsnodeid>

z/OS: RKANPARU(KDCSSITE)

In the simplest case of a single hub TEMS, this will look like

protocol:htems

such as

ip.pipe:HTEMS

or

ip.pipe:#10.11.20.34

If there are two hub TEMSes [Fault Tolerant Option] you will see two such lines. That also requires the CMS_FTO=YES environment variable.

You should never have more than one line for a single hub TEMS. Two or more lines slow things down with no value.

The hub TEMS does not need a glb_site.txt. It does no harm to have one but doesn’t help anything.

2) To manually verify the setup is correct you can use use the glb_site.txt values to test.

ping HTEMS

or

ping 10.11.20.34

The ping commands will not always respond depending on the network. However you can at least verify that the name resolves correctly. If not the Domain Name Server [DNS] may have incorrect information or the \etc\hosts file might be incorrect.

3) To manually verify the hub TEMS is reachable use telnet. Assuming you are using ip.pipe communications

telnet 10.11.20.34 1918

If you use ip.spipe the port target would be 3660.

If this fails that means there is a firewall router along the network path which is missing the rule to allow such communications. If there is no firewall involved, no problems. However if a rule exists it must allow communication to the well known port – 1918 in this case. The rule must be bidirectional. If the test fails your networking support team must make changes in the router firewall rules to allow the communication. Until that is done, there is no hope of a remote TEMS to hub TEMS connection working.

4) Another ITM communication requirement is that the entire path allow DF [do not fragment] packets. The packet size is most commonly seen as 1500 bytes however ITM will work with anything. From a performance standpoint a small MTU leads to more transmissions and lower throughput. Following are the tests for 1500 byte packets using ping options:

Linux:  ping -M do  -s 1472 10.11.20.34

AIX:  ping -s 1472 10.11.20.34

HPUX: ping -pv 10.11.20.34 1472

Solaris: ping -D 10.11.20.34 1472

Windows: ping -l 1500 -f 10.11.20.34

If these work with no complaint – all is well. The Linux/Unix size setting adapts to an automatically added IEEE header. The Linux -M do option means REALLY no fragmentation, even locally. A typical error seen recently looked like this : 

From 10.99.0.250 icmp_seq=1 Frag needed and DF set (mtu = 1442)

That means along the network path, the router at that address is preventing packet transmission.

See (7) below for network performance comments.

Your networking support team must resolve this issue before ITM communications can possibly work but that can be relatively easy to managed.  See next section.

5) When 1500 byte packets fail.

One recent case had a Virtual Private Network [VPN] link in the path that added more bytes to the packet. A 1500 byte DF packet became a 1514 byte DF packet and an intermediate router dropped the packet and communication failed. The solution was to change the interface on the hub TEMS from MTU 1500 to 1350. The remote TEMS and hub TEMS negotiated a MTU size of 1350 and then the added VPN bytes did not exceed the 1500 byte DF maximum at the routers. They could have gone higher of course.  Changing MTUs on interfaces is platform dependent and you will normally get sysadmins or networking people involved to make such changes.

Another recent case was when a customer router was configured packet DF maximum packet size was1448 bytes. In that case the router was reconfigured to the more standard 1500 byte DF limit.

Another recent case was a Linux environment where the configured DF maximum packet size was 992 bytes. There was some good reason for this and so the hub and remote TEMS system interface MTUs was changed to that number.

A highly interesting case involved two remote TEMSes that were primary and backup for many agents. Half the agents had remote TEMS1 as primary and half the agents had the remote TEMS2 as primary. One day most of the agents were offline. We discovered that TEMS1 required a MTU of 1400 and was thus having serious issues connecting to the hub TEMS. The agents connecting to it were also having problems. Most agents switched to TEMS2. TEMS1 and TEMS2 next became entangled because of the Agent fallback to primary logic. During the attempted switch from TEMS2 to TEMS1, agents became stuck and offline to both.  When the interface that TEMS1 used was set to MTU 1400 and TEMS1/Agents were restarted, things started working. When TEMS2/Agents were restarted, things continued OK. After 75 minutes the agents with TEMS2 as primary TEMS migrated back to that TEMS.

6) z/OS Hypersockets

Another recent issue involved z/OS Hypersockets. It had a MTU of 16K and its logic prevented negotiating down to 1500 bytes. The solution was to configure a second Hypersocket instance set to a MTU of 1500 bytes.

7) TEMS to TEMS communication requirements

In step (4) earlier – note the rtt average and the per cent packet loss. TEMS to TEMS communication is unstable if the rtt average is too high *or* if there is much packet loss. A general rule of thumb is that 50 milliseconds or lower is best. 100 milliseconds is OK. At 250 milliseconds or higher many installations will see instability including remote TEMS going offline.

These rules are extremely general and depend on the amount of TEMS to TEMS network traffic. A low traffic environment with not that much communications can often survive at higher latency levels.

The reason for this sensitivity in TEMS/TEMS communications is that much of the work happens with Remote Procedure Calls. After starting up, there are large call structures, up to 30,000 bytes or more. ITM divides each call into MTU [Maximum Transmission Unit] sized separate packets. All packets must arrive and be assembled at the target before logic can continue. If there is any degree of packet loss, many such attempted RPCs fail and need to be re-transmitted. At a higher level in ITM communications there are time out rules for transmission, typically 30 or 60 seconds. In cases of high latency and some packet loss, the resulting failures actually prevent normal work from proceeding. That means the remote TEMS does not get full instructions, like Situation definitions. It also means that the remote TEMS – which has been gathering situation results and likely generating events – is unable to send the events back to the hub TEMS.

The usual solution for a high latency link is to architect a hub TEMS at that location. That is extra work of course but that may be less expensive compared to upgrading a network. The hub TEMS to an event receiver like Netcool/Omnibus is relatively insensitive to event data transmission.

8) Use traceroute [Unix/Linux: traceroute; Windows tracert] checks on each communication path.The traceroute lists the network points encountered along the way from source to target. The network points should be the same but in reverse order when comparing results between the two points. If that is not true you have asymmetric routing and apparently that is rarely good.

In one recent case there was a new remote TEMS and a number of agents that were configured to connect to the remote TEMS. One symptom was that none of the agents could connect reliably. A second symptom was that agent TCP activity  broke the Hub TEMS to Remote TEMS initial table synchronization on a large [20,000 row] table. After this was identified by the customer network staff and corrected the problem were fully resolved. The lead network System Administrator stated 

“The break came when … we ran traceroutes from Rtems to the clients and clients to Rtems and noticed the paths weren’t the same.”

This is known as asymmetric routing. The problem is virtually invisible to ITM communications. You may need to involve your network folks because traceroute is sometimes blocked for normal users. If there is a difference, the network folks will need to figure out why the differences exist and adjust the network routing to correct the issue.

Summary

There are other potential issues. The only good news is that most such cases are rare and that ITM has the controls to adapt to almost any environment. Contact IBM Support if further help is needed.

Sitworld: Table of Contents

If you are interested in ITM communication control options see this document:

Sitworld: ITM Protocol Usage and Protocol Modifiers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: