Sitworld: Real Time Detection of Duplicate Agent Names

litez
Version 0.56000 2 July 2017

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

One more time I worked on a case where ITM misbehaved because some agents used duplicate names. This particular case involved “false alerts” where a situation event was observed – a missing process case on a Linux System. When investigated, the Linux System did have that process running and so it was a false positive alert. This cases are wasteful of everyone’s time and degrade the monitoring experience. After considerable time this was determined to be a duplicate agent name case: there were two different systems – one had a missing process and the other one did not. Each agent had the same name and so the investigation was against the wrong system. There were 100+ such cases. The effort consumed meetings over several months and wasted time and energy.

Here is a list of observed problems over the last few years collected by a colleague:

Agents going offline

Agents going offline and online repeatedly

Agents switching back and forth between TEMS’

Situation does not fire as expected

Situation fires unexpectedly

Situation does not start as expected

The data in the situation is not correct

Agent does not respond to requests

RTEMS does not respond to requests

RTEMS is hung

RTEMS is disconnected

HUB does not respond to requests

HUB is hung

Unstable ITM environment

SLOW TEP

TEP shows many navigator updates pending

TEP agent positioning flipping around

HIGH CPU or network usage related to TEPS

And more…

Duplicate Agent Name Progress up to now

There has been work ongoing to identify and resolve these cases. Here are useful tools.

The TEPS Audit blog post is a good first line of detection. You set a trace at the TEPS and then get a report with everything that TEPS sees.

The TEMS Audit blog post has some good reports – such as agents that repeatedly show online or reports at remote TEMS where the arrival of heartbeats is irregular.

The Database Health Checker blog post has a report section based in TEIBLOGT where you can see things like multiple additions to system generated MSLs which can imply duplicate.

We expect future process in this area, including advanced tracing and reports which identify cases where two agents with the same name are connecting to the same remote TEMS.

This post discusses a new cross TEMS check report on current live data.

Node Status Table Correlation Report

Each TEMS has an in-storage table INODESTS or Node Status table. A remote TEMS has entries corresponding to the nodes [agents]  that are connected to it. In ideal cases, the hub TEMS and the remote TEMSes will contain the same information. If there are differences. such as the same agent name present in two different remote TEMSes, that is a very strong signal of a duplicate agent name. That is the goal of the current project.

This package uses a TEPS utility to get the TEMS data for the report. Therefore it is run on the same system as the TEPS.

Package Installation

The following assumes TEPS was installed in the default directory. The data collection work is done on the system which runs the TEPS.   If you are using a non-default install directory then you will need to set an environment variable or specify the install directory in a parameter.

The package is  inodests_sum.0.56000. It contains

1) Perl program inodests_sum.pl.

I suggest inodests_sum.pl be placed an installation tmp directory.  For Windows you need to create the <installdir>\tmp directory. For Linux/Unix create the sql directory. You can of course use any convenient directory.

Linux/Unix:  /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always come with the Perl shell installed. For Windows you can install a no cost Community version from http://www.activestate.com if needed.

Parameters for running inodests_sum.pl

All parameters are optional if defaults are taken

-h home installation directory for TEPS. Default is

Linux/Unix: /opt/IBM/ITM

Windows: c:\IBM\ITM

This can also be supplied with an environment variable

Linux/Unix: export CANDLEHOME=/opt/IBM/ITM

Windows: set CANDLE_HOME=c:\IBM\ITM

-o Output file name

 default is inodests_sum.csv in current directory

-h Help display

-work where to store TEMS database files, default is temp directory, period means current directory

-all record results for all agents, not just problem cases, default show only problem cases

-off include offline agents, usually not much value

-redo perform the report logic using the existing files. Then hub.lst file must be manually determined and renamed. This is mostly for reporting defects to author.

-aff handle one case of lst data from an older TEMS database level

-thrunode create thrunode.csv file for use in a TEMS Database File restoration project. These are consensus thrunodes based on hub and remote TEMSes. The new project recreates missing TNODELST NODETYPE=V records and TNODELST NODETYPE=M system generated Managed System List entries – which are sometimes missing.

Running the inodests_sum.pl

In the temporary directory

perl inodests_sum.pl

Report format

See below for comments.

dup_rep2

Row 49/50 are identical in meaning. Column B is the source – which TEMS supplied the data. Row C is the THRUNODE – where the agent connected. Row D is the HOSTADDR – what system the agent was on an what was the listening port.

Row 48 shows the same agent name reporting to another remote TEMS and using a different ip address.

The conclusion here is that two agents are running on two different systems with the same name. This causes problems and should be stopped.

See below for comments on second report snippet.

dup_rep3

Row 7/8  are identical in meaning. Column B is the source – which TEMS supplied the data. Row C is the THRUNODE – where the agent connected. Row D is the HOSTADDR – what system the agent was on an what was the listening port.

Row 6 shows the same agent name report to another remote TEMS from the same system using a different listening port.

The conclusion here is that two agent instances are running on the same system. That is unusual at it should be stopped.

Correcting Problems

The general procedure is to investigate and resolve. In the first case, login to system and see why two different agent instances are running. Perhaps one was supposed to shutdown and the shutdown failed. Perhaps there are actually two different agents installed. In the second case, the agents likely each have CTIRA_HOSTNAME configured but accidentally with the same value. One of the agents needs to be reconfigured.

Thrunode Report file

The -thrunode option creates the thrunode.txt file in the current directory. This file reports the calculated valid remote TEMS each agent is configured to. If there is a conflict [reporting to multiple remote TEMS] that agent is left out of the report. The thrunode.txt report is planned for use in a new project to restore some cases of missing TNODELST objects.

Reporting problems

The program captures TEMS output of Node status Tables at each TEMS. If things do not work as expected, please capture those in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly.

Summary

The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents.

Sitworld: Table of Contents

History

inodests_sum.0.56000

option to export known good thrunodes – remote TEMSes that agents connect to

Note: Overhead Lights on New Cruise Ship

 

One thought on “Sitworld: Real Time Detection of Duplicate Agent Names

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: