Featured

Sitworld: Table of Contents

DaffidolsRescued

John Alvord, IBM Corporation

jalvord@us.ibm.com

 

Follow on twitter

Inspiration

After the number of blog posts increases, it is harder to find a way to locate ones of interest. The first section lists six posts I consider most important.

The second section is all the posts and very short comments.

Top 6 By Importance [My Prejudiced View]

ITM Database Health Checker

It is common to see a TEMS database [often called EIB] which has problems which cause confusion or sometimes lack of monitoring. This project identifies and documents 50+ advisories which will make things better.

Best Practice TEMS Database Backup and Recovery

The most costly support cases are when a customer does not have a proper backup. One memorable case was after a Storage Access Network device lost power and the most recent backup was over a year ago. I talk to people every day where TSM is used to make copies of the TEMS Databases and that almost every time is insufficient. This post was written jointly by a top L3 engineer and myself. If everyone did this the time to recover would drop substantially.

MS_Offline – Myth and Reality

MS_Offline type situations are extremely weighty and cause problems “at a distance”. For example a recent case with 9545 agents and 22 MS_Offline situations with 5 minute sampling interval has spawned multiple IBM Support interactions. They all come back to this one issue. When Persist>1 is set, the problems are much worse. The blog photo shows a California Condor [VERY LARGE VULTURE] lurking outside a window. Treat MS_Offline type situations as dangerous creatures and you will reduce your risk of injury and pain.

TEMS Audit Process and Tool

This has been available since 2012. It is a perfect way to examine the dynamic impact of workload [Situations, SOAP, real time data requests,etc] on a TEMS. With that knowledge you can make changes to avoid problem conditions. I have one customer who runs this on every TEMS each weekend and if “advisory messages” are present [noted via a non-zero exit code] sends the report to an analyst for review. The rate of emergency IBM Support meetings has dropped to near zero… at least for this area.

ITM Agent Health Survey

This tool provides a view of agents which are online but possibly non-responsive. Cases like this mean that real time data response is slow and partially missing, situations are not running, historical data is not being recorded. These are things everyone should worry about. This identifies the guard dog that doesn’t bark.

ITM Situation Audit

This tool performs a static analysis on all distributed situations and produces report of warning messages. It also reports which situations need TEMS filtering [instead of Agent filtering] which is a prime performance killer. Together with TEMS Audit you can really increase efficiency – reducing the cost of monitoring. This also gets early warning for situations with problems. Surprisingly, 50 of 51,000 situations studied actually had syntax errors – like VALUE instead of *VALUE. Anyway – I expect this to be an important tool over time.

Sitworld All Posts – Most recent first

Sitworld: Eliminating Duplicate Agents 5/29/2020 Eliminating Duplicate Agents
Sitworld: Summarization and Pruning Audit 3/23/2020 Summarization and Pruning Audit
Sitworld: ITM Permanent Configuration Best Practices 1/17/2020 ITM Permanent Configuration Best Practices
Sitworld: Scrubbing Out Windows Agent Malconfiguration Remotely 2/6/2019 Scrubbing Out Windows Agent Malconfiguration Remotely
Sitworld: Agent Diagnostic Log Communications Summary 8/20/2018 Agent Diagnostic Log Communications Summary
Sitworld: Adventures in Communications #1 7/2/2018 Adventures in Communications #1
Event History #15 High Results Situation to No Purpose 5/25/2018 High Results Situation to No Purpose
Event History #14 Lodging Problems 5/21/2018 Lodging Problems
Event History #13 Delay Delay Delay 5/10/2018 Delay Delay Delay
Event History #12 High Impact Situations And Much More 5/1/2018 High Impact Situations And Much More
Event History #11 Detailed Attribute differences on first two merged results 4/27/2018 Detailed Attribute differences on first two merged results
Event History #10 lost events because DisplayItem missing or null Atoms 4/24/2018 lost events because DisplayItem missing or null Atoms
Event History #9 Two Open Or Close Events In A Row 4/22/2018 Two Open Or Close Events In A Row
Event History #8 Situation Events Opening And Closing Frequently 4/21/2018 Situation Events Opening And Closing Frequently
Event History #7 Events Created But Not Forwarded 4/19/2018 Events Created But Not Forwarded
Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second 4/17/2018 Lost events with Multiple Results with same DisplayItem at same TEMS second
Event History #5 Multiple Results Same DisplayItem Same Second 4/16/2018 Multiple Results Same DisplayItem Same Second
Event History #4 Conflict Between DisplayItem and Attributes 4/13/2018 Conflict Between DisplayItem and Attributes
Event History #3 Lost Events Because DisplayItem has Duplicate Atoms 4/13/2018 DisplayItem has Duplicate Atoms
Event History #2 Duplicate DisplayItems At Same Second 4/10/2018 Duplicate DisplayItems At Same Second
Event History #1 The Situation That Fired Oddly 4/4/2018 The Situation that cried Wolf
Event History Audit 4/3/2018 Examine Event History in detail
Policing the Hatfields and the Mccoys 6/5/2016 Advanced Base/Until Sits
TEMS Audit Tracing Guide Tracing Guide Appendix 7/7/2017 TEMS Audit Tracing
ITM 6 Interface Guide Using KDEB_INTERFACELIST 6/30/2017 Document usage of KDEB_INTERFACELIST
ITM Agent Historical Data Export Survey 5/4/2017 Detect historical export issues at agents
FTO Configuration Audit 3/9/2017 Detect FTO configuration issues
Portal Client [TEP] on Windows Using a Private Java Install 12/28/2016 Avoid issues with system Java updates
TEMS Database Repair 11/18/2016 Recover from some broken TEMS database files
The Encyclopedia of ITM Tracing and Trace Related Controls 9/19/2016 Document tracing controls
ITM2SQL Database Utility 6/19/2016 Create TEMS database table report files
Real Time Detection of Duplicate Agent Names 3/23/2016 Duplicate Agent Live Detection
Portal Client Java Web Start JNLP File Cloner 3/18/2016 Create JNLP clone files for different types of TEP users
TEPSI Interface Guide 3/18/2016 Learn about TEPS Interfaces
Diagnostic Snapshort Utility 1/4/2016 Capture diagnostics on the fly
tacmd logs summary 12/31/2015 Summarize tacmd diagnostic logs
Restore Usability to ITCAM YN Custom Situations 12/24/2015 Fix some user custom situation affinitites
TEPS Audit 9/15/2015 Report on Potential Duplicate Agent names
Re-re-re-mem-ember Situation Status Cache Growth Analysis 8/1/2015 Identify pure situation w/changing DisplayItems
Attribute and Catalog Health Survey 4/19/2015 Check for missing or mis-used cat/atr files
ITM Database Health Checker 3/24/2015 Check TEMS database for issues
Suppressing Situation Events By Time Schedule 3/13/2015 Simple example of Until with timer schedule
Alerting on Daylight Savings Time Truants 2/27/2015 Situation alert when time differences
Reort on Daylight Savings Time Truants 2/20/2015 Report on Daylight Savings Time problems
Situation Formula with Calculations 1/28/2015 How to effectively calculate a formula
ITM Agent Census Scorecard 11/24/2014 Report avoidable TEMA defects
ITM Protocol Usage and Protocol Modifiers 10/21/2014 How to increase SOAP ports and much more
Agent Workload Audit 10/08/2014 What is actually happening at Agents
Situation Distribution Report 7/11/2014 What Situations are running where
CPAN Library for Perl Projects 7/11/2014 Using Perl without changing system
ITM Virtual Table Termite Control Project 6/17/2014 Recover from Performance Issue
ITM TEMS Health Survey 6/9/2014 Verify TEMS central services are working
The Situation That Cried Wolf 6/1/2014 Craft a situation for good practical results
Statistics After 50,000 Views 5/19/2014 Summary to date
*MIN and *MAX – the Little Column Functions That Couldn’t 5/15/2014 Two broken Column function
A Situation By Any Other Name… 4/28/2014 Discovering situation names
Do It Yourself TEMS Table Display 4/28/2014 Do It Yourself – Run SQL
Running TEMS without SITMON 4/7/2014 Recovery when TEMS very broken
ITM Situation Audit 3/20/2014 Compiler or Lint for Situation Formulas
SOAP Flash Flood 2/1/2014 tacmd bulkexportsit -d stresses TEMS
Sample EIF Listener project 1/17/2014 Do It Yourself Event listener
Situation Limits 12/31/2013 Situations have many limits
Put Your Situations on a Diet Using Indexed Attribute 12/19/2013 Performance boost for some Situations
Sampled Situations and Until Situations 11/25/2013 Until Processing expose
TEMS Audit Process and Tool 11/16/2013 Measure Agent stress on TEMS
Detector/Recycler for ITM Windows OS Agent 11/2/2013 Windows OS Agent recycler high CPU
1997 Kasparov vs. Deep Blue Chess Match 9/17/2013 Virtual Table hub Update hidden issue
ITM Agent Health Survey 9/6/2013 Discover unhealthy agents
Sampled Situation Blinking Like a Neon Light 9/4/2013 When situation events auto-close
Sampling Interval and Time Tests 8/24/2013 Sampled situations and time to event
TEMS Audit Advisory Messages 8/13/2013 Included in TEMS Audit Process and Tool
Situations Caused Domain Name Server Overload 7/24/2013 Situation generated emails hurt DNS
Configuring a Stable SOAP Port 7/16/2013 Best Practice when SOAP is vital
Best Practice TEMS Database Backup and Recovery 7/12/2013 If you don’t have a backup plan read this
Action Command Wars – A New Beginning 7/9/2013 Running lots of action commands
Detecting and Recovering from High Agent CPU Usage 7/1/2013 Linux/Unix OS Agent High CPU recover
An Efficient Design for Starting a Background Process 6/20/2013 Elegant hack
Adding Environmental Data to Action Command Emails 6/12/2013 When attributes are not enough
Situation Managing Other Situations 6/5/2013 Situation creates MSL
Mixed Up Situations 5/28/2013 Multiple Attribute Situation issues
Efficient Situation for Two Missing Processes 5/22/2013 Elegant efficiency solution
Getting a Good Nights Sleep 5/15/2013 Creating events to keep operators happy
Rational Choices for Situation Sampling Intervals 5/8/2013 Best Practice Interval choices
The Derivative Log Pattern 5/1/2013 Two stage situation logic
Super Duper Situations 4/28/2013 Understanding _Z_ situations
MS_Offline – Myth and Reality 4/17/2013 Everything about MS_Offlines
Auditing TEMS for Improved Performance 4/4/2013 Included in TEMS Audit Process and Tool
ITM Silver Blaze – Agent Responsiveness Checker 3/28/2013 replace by ITM Agent Health Survey
ITM TEMS Stress Tester Experiment 3/20/2013 ITM Analytics experiment
     

Summary

Wonderful World of Situations Table of Contents.

Photo Note: Daffidols rescued from Big Sur house fire garden – February 2014

 

Featured

Sitworld: Introduction

During my normal work. I see many interesting puzzles on how to accomplish useful work in IBM Tivoli Monitoring [ITM]. Often these revolve around situations.
Over time I will present some basic education on the subject but first there are some interesting cases that will benefit from some interactions.  So for the moment, I will assume you are familiar with ITM and its jargon.
This blog has been cloned and extended from an IBM site called DeveloperWorks which was shutdown on 2020/1/2.
John Alvord
Advisory Engineer
ITM L2 Support
jalvord@us.ibm.com
johngrahamalvord@gmail.com

Original Publish Date 2013/3/20

Sitworld: Simple Network Testing

orchid

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 6 August 2020 – Level 0.50000

Follow on twitter

Inspiration

Benchmarking ITM communication links is a tough job. You can validate many aspects using this document

ITM Communications Validation

But testing for throughput and capacity and errors is tough. Happily a former ITM developer implemented a technique to do much of this work. The rest of this document explains the process. It is documented for Linux/Unix environments.

Introduction

These “SimpleNetworkTests” are intended to measure the theoretical memory bandwidth of a single machine and the effective network bandwidth between two machines. Using product distributed binaries, a bulk-data transfer is performed using Tivoli Basic Services. This allows measurement of the customer’s IO subsystem capacity available to Basic Services’ clients: TEMS, TEPS, TEMA, and WHP.

Tivoli Monitoring data moves between TEMA and TEMS using Basic Services’ RPC calls. These simple network tests also use Basic Services’ RPC calls to perform “bulk-data transfer” but these tests do so ‘outside’ of the Tivoli Framework Management Server and Agent processes. Like communication issues discovered in BOTH the simple network tests and the customer’s TEMS and TEMA processes implicates the external I/O subsystem. Communications issues found in the TEMS (for example, a takeSample failure) with no corresponding issues in the “Bulk-Data transfer” simple network tests are assumed to be rooted in the Tivoli Framework.

Issues in the customer’s IO Subsystem become visible when the performance of these Simple Network tests are seen to vary depending on direction of the data flow. The ratio of the network transfer rate over the theoretical bandwidth for a system should be the same in both directions (measured inbound bandwidth should equal measured outbound bandwidth). Asymmetric results usually indicates IP routing malfeasance, mis-matched or biased buffer configurations, or MTU inconsistencies.

Install, Configuration and Use

SimpleNetworkTests v630 are performed using existing Operating System commands, Tivoli Monitoring binaries and shared libraries resident on the customer platform to minimize security and failure risks. Using the “itmcmd execute” command on the customer platform results in a run-time environment, identical to and composed of the installed Tivoli Monitoring software.

The TESTS and the expected results

kdcexed is the server daemon process. Like the TEMS, kdcexed listens for client connections and receives data. In these simple network tests, kdcexed is always first and it is run as a background process. kdcexer is the client process. Like the TEMA, kdcexer connects to the kdcexed server and sends data. In these simple network tests, kdcexer is started with an integer parameter and run in the forground.

LOOPBACK adapter tests. These tests are performed on a local system. Packets are exchanged on the loopback device only.

  1. Base connection test verifies local connectivity and is performed with two commands as root on the TEMS:
    1. $CANDLEHOME/bin/itmcmd execute ux “kdcexed &” to start the server daemon, then
    2. $CANDLEHOME/bin/itmcmd execute ux “kdcexer 0” to send a single client packet to the server. kdcexer launches the simple network tests client with a parameter, “0” , telling the client to stop the server. If both kdcexed and kdcexer commands complete with zero return code, the run-time environment is correctly established.
  2. Bulk-data transfer tests establish the machine’s theoretical bandwidth and is performed with these two commands root on the TEMS:
    1. $CANDLEHOME/bin/itmcmd execute ux “kdcexed &” to start the server daemon, then
    2. $CANDLEHOME/bin/itmcmd execute ux “kdcexer [1,2, … , N]” to start “N” client threads, where each thread sends 2 GBytes of data to the server. The SUM of the data transmitted IN and receive OUT divided by the wall-clock time establishes the theoretical bandwidth of the machine

BULK-DATA transfer tests. These tests are performed across the network, between two machinces using ITM binaries kdcexed and kdcexer . In this section, we are examining the theoretical bandwidth between two machines. We will use the tags “sending TEMS” and “receiving TEMS” , with the understanding that kdcexer runs on the “sending TEMS” and kdcexed runs on the “receiving TEMS”.

  1. Base connection test verifies and is performed with two commands as root on the TEMS:
    1. $CANDLEHOME/bin/itmcmd execute ux “kdcexed &” to start the server daemon, then
    2. $CANDLEHOME/bin/itmcmd execute ux “kdcexer 0” to send a single client packet to the server. kdcexer launches the simple network tests client with a parameter, “0” , telling the client to stop the server. If both kdcexed and kdcexer commands complete with zero return code, the run-time environment is correctly established.
  2. Bulk-data transfer tests establish the network’s theoretical bandwidth and is performed with these two commands root on the TEMS:
    1. $CANDLEHOME/bin/itmcmd execute ux “kdcexed &” to start the server daemon, then
    2. $CANDLEHOME/bin/itmcmd execute ux “kdcexer [1,2, … , N]” to start “N” client threads, where each thread sends 2 GBytes of data to the server. The SUM of the data transmitted IN and receive OUT divided by the wall-clock time establishes the theoretical bandwidth of the machine

Transaction tests. These tests are performed across the network, between two machinces using ITM binary kdh1 . kdh1 launched without any parameters performs as an http server.

    An http server daemon can be instantiated with the command

  • $CANDLEHOME/bin/itmcmd execute ux “kdh1 &”
    An http client can be launched with the following command:

  • $CANDLEHOME/bin/itmcmd execute ux “kdh1 -i http_client_requests.urls” where file $CANDLEHOME/http_client_requests.urls contains a list of http client requests:

Interpreting test results

the loopback logs for 519c4lp6 are these:

  • 519c4lp6_ux_kdcexed_5796eaca-01.log
  • 519c4lp6_ux_kdcexed_5796eaca-02.log
  • 519c4lp6_ux_kdcexed_5796eaca-03.log
  • 519c4lp6_ux_kdcexed_5796eaca.bandwidth.txt is aggregate
  • 519c4lp6_ux_kdcexer_5796eae1-01.log
  • 519c4lp6_ux_kdcexer_5796eae1-02.log
  • 519c4lp6_ux_kdcexed_5796eaca.bandwidth.txt is aggregate

The sending process, kdcexer, xmits 21,000 1K blocks (21 Meg) in 5 seconds. This is seen in 519c4lp6_ux_kdcexer_5796eae1.bandwidth.txt . The receiving process, kdcexed, receives 21,000 1K blocks (21 meg) in the same 5 seconds. This is seem in 519c4lp6_ux_kdcexed_5796eaca.bandwidth.txt . [ (21 MBytes in + 21 MBytes out) / 5 seconds ] = 8.4 MBytes/sec

the loopback logs for bl59lp5 are these:

  • bl59lp5_ux_kdcexed_5796c98d-01.log
  • bl59lp5_ux_kdcexed_5796c98d.bandwidth.txt is aggregate
  • bl59lp5_ux_kdcexer_5796c9a4-01.log
  • bl59lp5_ux_kdcexer_5796c9a4.bandwidth.txt is aggregate

This machine is faster. The entire 42 MBytes is moved in 3 seconds, giving us a rate of 14 MBytes / sec.

The network logs of bl59lp5 sending to 519c4lp6 are these:

  • bl59lp5_ux_kdcexer_5796f2dd-01.log
  • 519c4lp6_ux_kdcexed_5796eefb-01.log
  • 519c4lp6_ux_kdcexed_5796eefb-02.log
  • 519c4lp6_ux_kdcexed_5796eefb-03.log

bl59lp5_ux_kdcexer_5796f2dd.bandwidth.txt shows we transferred 21 Meg in 4 seconds, giving a rate of 5 MBytes/sec.

The network logs of 519c4lp6 sending to bl59lp5 are these:

  • 519c4lp6_ux_kdcexer_5796eccb-01.log
  • 519c4lp6_ux_kdcexer_5796eccb-02.log
  • bl59lp5_ux_kdcexed_5796ecad-01.log

In either direction, the 21 MBytes was transferred in 4 seconds for an effective transfer rate of 5 MBytes/sec machine-to-machine.

Notes

  • KBS_DEBUG=N and KDC_DEBUG=Y is the trace level required for all Simple Network Tests.
  • the aggregate, ‘bandwidth’ reports are generated by grepping the RAS1 logs for “(1024)” and re-directing the output from all logs of a specific process instance (“xxx…-01.log” , “xxx…-02.log ” , … “xxx…-NN.log”) to file ‘*.bandwidth.txt’ . Insert CR/LF using workpad on ‘*.bandwidth.txt’ and save as “Text document – MS DOS format”.

 

Summary

This tool and process will ease the effort of detecting and resolving duplicate agent name issues. This action will improve monitoring, reduce TEMS impact, reduce human confusion and help TEPS performance. The benefit is well worth the effort.

Sitworld: Table of Contents

History and Earlier versions

0.50000 – Initial publish

Photo Note: Orchids Galore on the Kitchen Counter

 

Sitworld: Eliminating Duplicate Agents

orchid

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #5 – 20 August 2020 – Level 0.61000

Follow on twitter

Inspiration

Duplicate agent names in a ITM environment present a multi-dimensional horror. That is when two different agents with the same name are connecting to a remote or hub TEMS. Here are the bad things that can happen:

1) Reduced monitoring because only one agent at a time can report results. As each agent connects, the earlier one is ignored and the new one has to be populated with the right situations. Effectively no agent gets to warn of issues reliably and thus monitoring is at best diminished and and at worst almost eliminated. It can even lead to some historical data always being all zeroes.

2) Each time an agent is seen to change in some way [different IP address, different remote TEMS, different agent type, different agent version number, different listening port etc] the TEPS needs to recalculate the Topology… a representation of what ITM processes are connected to the remote and hub TEMSes. That can be very expensive and duplicate agents cause the recalculation to happen frequently. TEPS performance gets very very bad – visible as Navigator Updates Pending keep showing up.

3) The process of dropping one agent instance and starting up another is a workload drain on the remote TEMS [mostly] and the hub TEMS somewhat. This sometimes causes TEMS instability including crashes.

4) Duplicate agents create human confusion and wasted work. An alert is seen at an agent [one of a duplicate pair]. An investigation is launched at its duplicate pair and – no trouble found! The issue continues and human work is wasted.

5) Historical data becomes suspect. Historical data involves the agent name and in cases of duplicate agents you get double data that really apply to the pair. More human confusion.

There are lots of reasons to avoid duplicate agents. Until now repairing those cases have required intensive manual work. This new tool automates some of the work and  reduces manual labor in correcting the condition. It doesn’t solve every problem but it speeds the process and exposes the data needed to begin manual corrections.

Parallel: If you watch the second Thomas Crown Affair movie (1999) – the story of an art robbery. The thief is notably dressed in a sharp suit, thin tie, a attache case and a bowler hat etc. In the climatic scene hundreds of hired actors dressed identically are walking at random in the art museum halls and rooms. The thief walks away with the prize. I often thought of this an an excellent parallel to having duplicate agents.

Background – How Duplicate Agent Arise

Here are some of the ways duplicate agents arise. This list is certainly incomplete because new methods are found from time to time. Agents names by default are formed from a 1) a prefix [like the Primary] seen in Windows OS Agents; 2) the hostname; and 3) a Suffix which indicates the agent type. Those are concatenated with colons [:] between the values, the result is truncated to 32 characters at most and that is the agent name. The system hostname can be modified by setting the CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME environment variables to control the hostname and thus the agent name. That is a common theme in recovery actions and won’t be specifically called out in the following examples.

1) The hostname may be so long that the composed agent name is more than 32 characters long. If there are multiple agents on the same system, the truncation may create duplicate agent names even though the agents are of different types.

2) The same type of agent may be installed in different install directories on the same system. In one memorable case, a Unix OS Agent was installed at /opt/IBM/ITM and a few years later a revised Unix OS Agent was installed on 345 systems at  /opt/IBM/ITM/bin. In another case, the customer really wanted two OS Agents on a Linux system and just installed them in two installation directories. In a third case, a separate team installed an agent and was unaware of the existing agent.

3) Two different systems may be created with the same hostname. Thus two agents will be on different systems with the same name.

4) During a remote deploy operation, the old agent is stopped. Sometimes that shutdown fails and the old agent lingers with the same name. This can be identified because the two agents are from the same system and with the same listening port. That is “impossible” for TCP communications so it means the old agent is mostly stopped but is still sending node status updates even though it isn’t doing anything else. The recovery here is to stop the agent manually. Then kill -9 the remaining agent. Then start the agent again.

5) Two agents could have the same CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME settings. Often that happens when a system cloning process fails to adjust the above settings. The worst case seen [so far] was 500+ Windows OS Agents with the same name.

6) The two agents could be in a cluster environment. That means two or more systems which run a single service like a database server… of course the database server only is action on one system at a time. In such a case a monitoring agent is sometimes set to run on all sides of the cluster. The hostname is often the same and that creates the duplicate agents condition. The correct recovery is to 1) start and stop the agent as the system being monitored is started/stopped, 2) use CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME to make sure the agent has the same name and 3) configure KDEB_INTERFACELIST=xx.xx.xx.xx  where xx.xx.xx.xx is the Virtual IP Address for the cluster – a single ip address to access the system [like a database] regardless of which system in the cluster is active. In this way the TEMS sees a single system that occasionally stops and starts and NEVER changes name or ip address.

7) Some agents take on the role of managing agents. For example Tivoli Log Agent can run singly or it can run with multiple subnode agents. Those subnode agents are specified in a configuration list. It is all to possible for two subnode agents, connected to different managing agents, can have the same name. In that case you need to change the managing agent configuration list to ensure unique agent names.

8) The MQ agent has a unique way to specify a hostname. If that is not configured, the agent name will look like XXXXXX::MQ. That can create duplicate agents. The recovery here is to update the agent cfg and add “SET AGENT NAME(<hostname>)”. That fills in the blank name between the colons.

9) In ITM environments with z/OS agents, you may see some duplicate agents. There is no known way at present of resolving most of those issues.

10) Some large environments have little control of new agents being connected. You can eventually track down the issues but in the meantime you have a steady workload to recover.

A Semi-Automated Recovery Plan

Some of the above cases can only be recovered using manual configuration. The most labor intensive cases arise when there are many of them. It is not uncommon to see dozens to hundreds. Some are the simple case of a OS Agent and a Tivoli Log agent, same name on two different systems. Imagine if there were 120 of those pairs. You also see cases of 50 to 500 identical agents. Tackling the 500+ identical Windows OS Agents could take weeks and require access to each system to reconfigure the agents.

The goal of this project is to automate the most painful tasks. This requirement was a paradox for a long time until I recognized this:  if you have a duplicate agent pair, they are almost worthless for all the above reasons. Agent configured hostnames can be changed using a command like this

./tacmd setagentconnection -n <os_agent_name> -t <pc>
-e CTIRA_HOSTNAME=xxxxxxxxx CTIRA_SYSTEM_NAME=xxxxxxxx

Where

os_agent: the agent name of the OS agent on the system

pc: the product code [like NT for Windows OS Agent]

xxxxxxxx: the new hostname.

*note we always change CTIRA_HOSTNAME [ for TEMS awareness by agent name] and CTIRA_SYSTEM_NAME [for TEPS awareness] and make them the same to avoid confusion.

If there were only two duplicate agents, we could use the revised name as hostname-DUP1. When this completes – one of the two agents will have the old name and one will have a hostname with the characters -DUP1 appended. Thus there will be two different agent names and the duplicate agent name issue is resolved. More work is needed, of course.

Any agent should be integrated into the ITM environment. Every agent is added to a system generated Managed System List like *NT_SYSTEM. The agent name may also have been added to custom Managed System Lists. Finally the agent name may have been added to the direct distribution of some situations. There are also uses of Situation Group direct distribution which is awaiting development [a test case is needed!]

To complete the work the user must

1) Rename the hostname-DUPn agent appropriately.
2) Add the new agent to any custom Managed System Lists where the old agent was used.
3) Add the new agent to the distribution of any situations where the old agent was used.
4) Determine out why it happened and change processes so it doesn’t happen again.

The integration gets a little more lengthy if, say you had 151 duplicate OS Agents. However the benefit is immediate: You get some monitoring going on – which is better than none. You can make all the subsequent changes remotely – for these cases anyway. They may be cases where manual configuration is needed.

You may also have some cleanup work to do. One customer with two sets of 100+ agents with same name found that the automation partially failed. Some of the attempted changes were rejected because the agent was *OFFLINE. However many changes were successful and the manual work was seriously reduced. You could also repeat the automation process after the first wave of cleanup.

Package Installation

The package is dup2do.0.61000 and contains the dup2do.pl Perl program and a dup2do.cmd for Windows usage.

If this will be run on a Linux/Unix system, Perl is almost always installed in the system. If this will be run on a Windows system, you will have to install a Perl. My current choice is Strawberry Perl.

http://strawberryperl.com/

It is good quality without serious license restrictions. Of course if it will be used on a company system, you will likely want to contact your manager and check with legal.

DUP2DO Process

[dup2do stands for Duplication work to do]

This project uses files created by three other Sitworld projects.

Sitworld: TEMS Audit Process and Tool

The first step is to review the TEMS Audit report file  to see if duplicate agents have been identified. Here is a small extract of a good example:

TEMSREPORT082: Agent Flipping Report - Multiple Systems
Count,Agent,Hostaddrs, 
2,ibm_au_winatca5820:NT,ip.spipe:#9.13.192.26[REMOTE_AUULDPLITM020] ip.spipe:#10.114.95.68[REMOTE_AUBHDPLITM030],
2,ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.101.100[REMOTE_cgkibplitm010] ip.spipe:#10.132.187.63[REMOTE_cgkibplitm010],
2,ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.30[REMOTE_cgkibplitm020] ip.spipe:#10.132.101.103[REMOTE_cgkibplitm010],
This involves three agent names that are reporting from two different systems each. You see a count, the agent name, the protocol/system and remote TEMS involved.

Use a new recent copy of TEMS Audit 2.25000 or later. Rerun it on your hub TEMS logs directory

perl temsaud.pl -v -logpath /opt/IBM/ITM/logs  -dup

This will produce a file dedup.csv for dup2do. Here is an extract from the same source.

ibm_au_winatca5820:NT,ip.spipe:#9.114.95.68,
ibm_au_winatca5820:NT,ip.spipe:#9.13.192.26,
ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.101.100,
ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.187.63,
ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.103,
ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.30,

Copy that file to the directory where dup2do will be run. You can leave it in place and copy in the second and third needed file.

*NOTE* If there are relatively few duplicate agents [or none!], you can manually clean them up and skip the following more complex process. The process saves a lot of time when there are many duplicate agents but for a few agents manual configuration works fine.

The second file comes from this project

Sitworld: Situation Distribution Report

There is an initial data capture program which runs at the TEPS

Windows – sitinfo.cmd
Linux/Unix – sitinfo.sh [in sitinfo.tar]

Run that as instructed which will create a set of LST files.

Next run this command

perl sitinfo.pl -lst -onerow

The result will be in sitinfo.csv report file. Here is a small excerpt.

ibm_cpuutil_gntf_gsmabase,Fatal,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Processor.%_Processor_Time *GE 95 *AND *VALUE NT_Processor_Summary.High_Process_Name *NE mcshield *AND *VALUE NT_Processor.Processor *EQ ‘_Total’,

ibm_dsp_gntc_win,Critical,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Logical_Disk.%_Used *GE 90 *AND *VALUE NT_Logical_Disk.Disk_Name *NE ‘_Total’,

ibm_dsp_gntf_win,Fatal,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Logical_Disk.%_Used *GE 95 *AND *VALUE NT_Logical_Disk.Disk_Name *NE ‘_Total’,

As you can see, it names the situation involved, the severity, the system IP address, the agent name, the distribution and the situation formula. The distribution field

M|ibm_nt_infinity_prod;

means it is distributed via a Managed System List [M] and gives the name.

Copy sitinfo.csv to the directory where dup2do will run. That might be in the logs directory.

The third file comes from the

Sitworld: Database Health Checker

That package comes with a shell file datasql.sh [in a datasql.tar container] or datasql.cmd. Follow the post instructions to run that using TEPS batch commands and it will create files QA1DNSAV.DB.TXT and QA1CNODL.DB.TXT. Those two files are also needed for DUP2DO. It is optional and if not present the dup2do_plus.csv file will be missing.

With those the files ready, run the dup2do command

perl dup2do.pl

Options

-dupsleep nnnn     add nnnn seconds after each tacmd setagentconnection. This should usually be 660 seconds if there are cases of 3 or more systems where the duplicate agents are running. After a rename the original agent may not be online for a while. It does mean the process may run for a long time [even overnight], but the results are beneficial.

-dupall  hostname   for this hostname, rename all ITM agents on that system. Default is just the OS Agent. Reviewing the dedup.csv manually. See iff there are cases where the OS Agent is duplicate and also  other agents. In that case you can set the dupall option which means all agents on that system will change hostname with a single command. You can set a dupall option for each such cases. Otherwise you will have to do manual setagentconnection commands for the non-OS Agents on each such agent and system.

The dup2do.pl command creates several output files.

dedup.sh – Linux/Unix shell command to run the needed tacmd commands for OS Agents. Following is an example from that command [a single line].

./tacmd setagentconnection -n ibm_au_winatca5820:NT -t NT -e CTIRA_HOSTNAME=ibm_au_winatca5820-DUP1 CTIRA_SYSTEM_NAME=ibm_au_winatca5820-DUP1

dedup.cmd – Windows shell command to run the needed tacmd commands
same as dedup.sh but for windows

dedup_nos.sh – Linux/Unix shell command to run the needed tacmd commands for other Agents.

dedup_nos.cmd – Windows command to run the needed tacmd commands for other Agents.

The above shell commands may contain comment lines about duplicate agents that require manual configuration. For example two TEPS might have the same name. Or an MQ agent has a duplicate name and needs a change to the cfg file to make the names unique.

The shell commands will run in waves with a 660 second delay between waves. That is because the duplicate agents not online need time to connect again. In extreme cases – like 150 duplicate names, it will still take a long time. The result is worth it.

There are also cases where duplicate agents are using indirect connections like with EPHEMERAL:Y or beyond a NATing firewall. In that case one can be converted but the process may need to be repeated.

dup2do.subnode.csv – Report on potential duplicate subnode agent names.

dup2do_correct.csv – Report on how to re-integrate the -DUPn agents into the ITM environment.
Example lines

MSL,ibm_nt_infinity_prod
add,ibm_au_winatca5820-DUP1:NT,

The new agent name has to be added to the above managed system list. The full report also shows the situations that need the agent name added to the distribution.

dup2do_edit.sh – Linux/Unix shell command to run the needed  commands

Example tacmd which will add in the new agents to the named managed system list. In this case there were actually many new agents needed adding.

./tacmd editsystemlist -e ibm_nt_infinity_prod -a ibm_au_aubhdpwcva001-DUP1:NT ibm_au_aubhdpwcva001-DUP2:NT …

dup2do_edit.cmd – Windows shell command to run the needed commands
same as dup2do.sh but for windows

The last file dup2do_plus.csv knits together the dedup.csv data and the QA1DNSAV.DB.TXT and QA1CNODL.DB.TXT data. It supplies supporting data when validating what will be happening. The second column “dup” means it came from the dedup.csv. The “msn” means it came from the TEMS database snapshot.

Case 1:  simplest case

9.22.71.58,dup,ibm_s4450024:06,ip.spipe:#9.22.71.58,
9.22.71.58,dup,ibm_s4450024:NT,ip.spipe:#9.22.71.58,

9.22.71.62,dup,ibm_s4450024:06,ip.spipe:#9.22.71.62,
9.22.71.62,dup,ibm_s4450024:NT,ip.spipe:#9.22.71.62,
9.22.71.62,msn,ibm_s4450024:06,Y,06,03.43.00,ip.spipe:#9.22.71.62[11853]<NM>ibm_S4450024</NM>,
9.22.71.62,msn,ibm_s4450024:NT,Y,NT,06.30.07,ip.spipe:#9.22.71.62[7757]<NM>ibm_s4450024</NM>,

The hostname ibm_s4450024 was used in two different systems on a NT agent and a 06 agent. Both systems were called out as duplicate agents. The TEMS database shows that only two are actually online at the moment, as system stole the other’s identity. This is a case where you can add -dupall ibm_s4450024 to the DUP2DO invocation and both NT and 06 agents will be get the new names.

Case 2 – more complicated

9.30.34.38,dup,blinsts:ibm_smlsxls032:UD,ip.spipe:#9.30.34.38,
9.30.34.38,msn,ibm_smlsxlm032:KUL,Y,UL,06.22.02,ip.spipe:#9.30.34.38[15949]<NM>ibm_smlsxlm032</NM>,
9.30.34.38,msn,ibm_smlsxlm032:08,Y,08,03.20.00,ip.spipe:#9.30.34.38[7757]<NM>ibm_smlsxlm032</NM>,
9.30.34.38,msn,ibm_smlsxlm032:LZ,Y,LZ,06.30.07,ip.spipe:#9.30.34.38[11853]<NM>ibm_smlsxlm032</NM>,

9.30.34.39,dup,blinsts:ibm_smlsxls032:UD,ip.spipe:#9.30.34.39,
9.30.34.39,msn,blinsts:ibm_smlsxls032:UD,Y,UD,07.10.00,ip.spipe:#9.30.34.39[15949]<NM>ibm_smlsxls032</NM>,
9.30.34.39,msn,ibm_smlsxlm033:08,Y,08,03.20.00,ip.spipe:#9.30.34.39[7757]<NM>ibm_smlsxlm033</NM>,
9.30.34.39,msn,ibm_smlsxlm033:LZ,Y,LZ,06.30.07,ip.spipe:#9.30.34.39[11853]<NM>ibm_smlsxlm033</NM>,

Here you see a case where the UL/08/LZ agents have proper names and different. But the UD agent has the same hostname on both systems. At this moment, the UD 9.30.34.38 on is not seen on TEMS because the UD on 9.30.34.39 with the same agent name has stolen the identity. In this case someone needs to manually fix up the UD agent on 9.30.34.38 so they two UD Agents will cooperate.

This may actually be a clustering case, where the UD agent(s) should be started and stopped with the DB2 systems, and the ip address forced to the Virtual IBM address that DB2 uses whatever system it is running on.

Manual Finishing Work

After the initial dedup.sh [or dedup.cmd] creates non-duplicated agents but with strange names.

Decide on what your site agent names should be. Do another tacmd setagetconnection command to set the agent name as desired. Carefully track the old and new -DUPn names and the eventual site agent names. Run a tacmd cleanms command [or the TEP equivalent] to remove the -DUPn names which show as offline after that process.

Use the dedup_nos.sh or dedup_nos.cmd commands to clean up non OS Agents.

In all cases, review the shell comment lines and do manual corrections as needed.

Update needed MSLs and Situation Distributions using the dup2do.csv report file or the dup2do.sh [or dup2do.cmd] commands. Track the original name, the -DUPn name and the selected site agent name to make the right choices.

After a hub TEMS recycle and running for while, redo the TEMS Audit and look at the TEMSREPORT082 report section to see what issues if any remain. Usually at this point a manual effort is required.  Here is an example:

3,MSSQLSERVER:ibm_my_kuldcpwsql06:,ip.spipe:#9.136.174.49[REMOTE_IBBFDPLITM020] ip.spipe:#9.136.174.30[REMOTE_IBBFDPLITM020] ip.spipe:#9.136.174.48[REMOTE_IBBFDPLITM020],
2,MSSQLSERVER:ibm_sg_sgdcwpwsql01,ip.spipe:#9.105.171.34[REMOTE_sinibplitm030] ip.spipe:#9.105.171.33[REMOTE_sinibplitm030],

The names are so long a truncation has occurred. In cases like these you need to examine each system and see what is happening. There could be clustering conflicts happening, but it could be subnode agents. Or maybe you should set the CTIRA_SUBSYSTEM_ID to a shorter value. That sets the first section of the agent name… which defaults to blank except for Windows OS Agent.

Note that z/OS agents are ignored in this process. Eliminating duplicate agent names is a future goal.

Summary

This tool and process will ease the effort of detecting and resolving duplicate agent name issues. This action will improve monitoring, reduce TEMS impact, reduce human confusion and help TEPS performance. The benefit is well worth the effort.

Sitworld: Table of Contents

History and Earlier versions

dup2do.0.61000
Handle case where only dedup.csv is present

dup2do.0.60000
On OS Agents, process even if missing in TNODESAV

dup2do.0.59000
Handle non-OS Instanced agents better

dup2do.0.58000
Handle OS agents and non-OS agents separately, track more error cases

dup2do.0.57000
Handle multiple levels better,warn multiple ephemerals

dup2do.0.56000
Handle more non-OS Agent cases, do not change HD/WPA agents

dup2do.0.55.000
Don’t use setagentconnection on MQ type agents
more non-os agent logic improvements

dup2do.0.54000
handle some non-OS Agent cases

dup2do.0.53000
handle managing agents better
make better output names
handle Situation Group distributions

dup2do.0.52000
handle long hostnames

dup2do.0.51000
correct sleep logic

dup2do.0.50000 initial version published

Photo Note: Orchids Galore on the Kitchen Counter

Sitworld: Summarization and Pruning Audit

Bandit_on_Keyboard

Version 0.51000 23 March 2020

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

ITM has an marvelous facility to store historical data. This facility includes logic to discard old data and summarize data into longer time periods. It is quite easy to get more data recorded but balancing that with the limited database storage capacity is much harder. A recent customer had almost exceeded the 6 gigabytes available. If there was no more database storage, historical data at the agent would not be transferred to the database and the agent file systems were at risk of gradually filling up.

The S&P logs contain much of the needed information, but it is scattered about, runs multiple processes at the same time and since there can be hundreds or thousands of agents, manual extraction of the data is almost impossible.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.strawberry.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

SP Audit has been tested with

This is perl 5, version 28, subversion 1 (v5.28.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

A zip file is found found spaudit.0.51000. There is one file spaud.pl. Install it somewhere convenient.

Run Time Options

Options:

-v Produce some progress messages in STDERR

The remaining parameter is a log file specification. This needs to be a single file like this

sutlpar71_sy_java_5a68cc39-01.log

Ideally this should be a selection of the log that represents a single S&P processing run. That can span several  diagnostic log sections. Alternatively a single log section can represent multiple processing runs. As of yet I have not found any way to automate this process, but am still looking. View the sy_java.inv inventory file to see which diagnostic logs are current – the top line is the most recent log.

To isolate a segment search for “Trace resumed” for the starting point of a run and “Trace paused” for the ending point. Save those into a separate file for processing.

SP Audit Reports

Three reports are all keyed by the attribute group name [ AIX_Network_Adapters]

sp_sum.csv – Summary Report – Sorted in descending order by Aggregate_size

Summarization and Pruning log Audit ReportTable,Nodes,Aggregate_Rows,Pruned,SQLs,Time,Aggregate_Size_Bytes,SizePC,TotSizePC,
Top_Memory_Processes,88,400360,0,923755,293742,746271040,12.75%,12.75%,
Network,88,403639,12096,198870,630066,645822400,11.03%,23.78%,

ReportTable: The attribute group name
Aggregate_Rows: gathered from “Rows read” lines
Pruned: gathered from “Rows pruned” lines
SQLs: gathered from “For table” lines
Time: gathered from “Elapsed time” lines
Aggregate_Size_Bytes: Aggregate lines * rowsize. [from built in table] This report is sorted with this.
SizePC: Per cent of this table size compared to total size
TotSizePC: Cumulative per cent of table sizes

sp_det.csv – Detail Report

Summarization and Pruning log Detail
Table,Nodes,Aggregate,Pruned,SQLs,Time,
KVA_NETWORK_ADAPTERS_TOTALS,8,63805,6912,1358,39109,1,1,1,1,1,1,1,1,1,1,7,4008,255730440,0,
KVA_NETWORK_ADAPTERS_TOTALS_Y,0,0,0,0,1294,1,1,1,1,1,1,1,1,1,1,7,4008,0,0,
KVA_NETWORK_ADAPTERS_TOTALS_Q,0,0,0,0,131,1,1,1,1,1,1,1,1,1,1,7,4008,0,0,

Mostly used to diagnose the summary report, described minimally here,

Table
Nodes
Aggregate
Pruned
SQLs
Time
various unnamed columns which include the type of sumarization and pruning and the row size.

sp_err.csv – Error Report – Track Down Issues

Summarization and Pruning Error Detail
AttributeGroup,Node,Line,SQL_exception,Batch_First_Exception,Exception,
KVA_PROCESSES_DETAIL,shtppvm01-vios1:VA,4633,SQL State = null , SQL Error Code = -4229,com.ibm.db2.jcc.am.SqlTransactionRollbackException: Error for batch element #1: The current transaction was rolled back because of error “-289”.. SQLCODE=-1476, SQLSTATE=40506, DRIVER=3.63.123,Failed to create aggregates for node: (shtppvm01-vios1:VA),

AttributeGroup: The attribute group name
Node: Agent Name
Line_exception: line number in diagnostic log
Batch_First_Exception: lots of details
Exception, Summary of action

Sometimes the errors are obvious, If not involve IBM support to resolve the issue. Sometimes it is a database issue and sometimes it is agent application support.

Summary

Report on Summarization and Pruning processing.

Versions:

This project is also maintained in github.com/jalvo2014/spaudit  and will sometimes be more up to date [and less tested] compared the the point releases. You can also use this github distribution to review history and propose changes via pull requests.

Here are recently published versions, In case there is a problem at one level you can always back up.

spaudit.0.51000

Correct spelling in titles

Sitworld: Table of Contents

Note: Bandit, a Maine Coon cat dreaming of a musical career

 

Sitworld: ITM Permanent Configuration Best Practices

Bandit_playing

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 1/16/2020 – Level 1.00000

Follow on twitter

Inspiration

I am often asked for advice on how to change an ITM process configuration. It could be something simple like adding a diagnostic trace definition. It could be more complex like adding EPHEMERAL:Y to a communication control. There are lots of ways to do this badly. This document will show reliable methods with no side effects. We will be mostly specifying how to introduce environment variables to make configuration changes.

These methods are platform specific. Therefore there are four sections for Linux/Unix, Windows, z/OS and i/5. The z/OS and i/5 do not have override schemes but are included for completeness.

Linux/Unix

The modern way to add or change environment files is to add an environment file. This file will exist in the same directory as other configuration files <installdir>/config. The environment file is used alongside the ini file. For example you will see a lz.ini file [for a Linux OS Agent] and the environment file will be lz.environment. Typically you create the environment file this way using file information about the ini file. For example

nmp180:/opt/IBM/ITM/config # ls -l lz.ini
-rwxrwxrwx 1 root root 2853 Feb 4 2015 lz.ini

cd /opt/IBM/ITM/config
touch lz.environment
chmod 777 lz.environment
chown root lz.environment
chgrp root lz.environment

Your environment will likely have different permissions, owner and group. The requirement is that the environment file have the same permissions/owner/group as the related ini file. After this created it is a zero byte file which can be edited. You may find that such a file already exists. That can happen if you performed a tacmd setagentconnection operation with the -o  option to set an environment variable. In such a case, you can just use it as is.

Environment File Usage and Purpose

In a Linux/Unix environment, the ini file and many other files are used to create a config file. This is the file the agent startup up references for the environment variables that define how it should run. In many cases the config file is created each time the agent starts [or stops]. Some ITM processes are instanced agents. In that case the config file is created only during a ./itmcmd config operation. A TEMS is an instanced agent. During the startup, after the config file is read in, an environment file will be read and used to override the  config file.

Environment files are known as a permanent overrides because that file is not changed during ITM maintenance. Even in a case where a ./itmcmd config is used to change settings, the environment file is still used unchanged. If you change the lz.ini file, the environment file will still override. That can be both good and bad. If you forget about the environment file and make some changes, they might be overridden and not take effect which would be confusing at best.

A good feature of the environment file is that you don’t need to redo the config operation on an instanced agent like a TEMS. It is permanent but also somewhat temporary. For example you could add a diagnostic trace setting and use it for a while and later delete it… All with no need for any reconfiguration.

Environment file usage examples

There is one environment variable KDC_FAMILIES which should be overridden in an environment file. Standing alone in an ini file, KDC_FAMILIES will be overridden in the constructed config file. So if you want to make changes, the ini file doesn’t work so hot. For example the following line in an environment file

KDC_FAMILIES=EPHEMERAL:Y HTTP_SERVER:N ${KDC_FAMILIES}

would take the previous definition of KDC_FAMILIES and prefix two protocol modifiers to 1) run without any listening port and 2) run without an internal web server. That is sometimes set up to avoid the agent having any listening ports – better for security.

An environment file is interpreted by the Korn Shell interpreter, so it can perform some logic. Here is a relatively complex example:

CTIRA_LOG_PATH=${CANDLEHOME}/logs
KBB_VARPREFIX=%
RHN=`hostname|cut -d. -f1`
PRODUCTCODE=ms
KBB_RAS1_LOG=%(CTIRA_LOG_PATH)/${RHN}_${PRODUCTCODE}_%(syspgm)_%(sysutcstart)-.log INVENTORY=%(CTIRA_LOG_PATH)/${RHM}_${PRODUCTCODE}_%(syspgm).inv COUNT=16 LIMIT=20 PRESERVE=1 MAXFILES=64

The purpose here is to configure more and larger diagnostic log segments. The %(xxx) is unique to ITM basic services and means to substitute an environment variable otherwise defined. In the RHN definition you can see how a linux/unix command is run – the hostname command – and the first period delimited part of the hostname set to that environment variable. Again this is all specific to a needed piece of  configuration.

Environment File Summary for Linux/Unix

As seen above this is a smart and effective way to introduce configuration changes. It was introduced in ITM 623 FP2.

Before ITM 623 FP2 Use Source Include

ITM 623 has gone End of Support in April 2019. You can still use agents at earlier levels, but there are no further fixes available. You see this method with older agents and especially the AIX premium agents which are permanently at ITM 622.

A source include method is used to achieve the same goals. In this case there would be a file like lz.override and it would contain the needed overrides, usually single quoted. Here are two examples.

KBB_RAS1=’error’

KDC_FAMILIES=”EPHEMERAL:Y HTTP_SERVER:N ${KDC_FAMILIES}”

In Shell processing, single quoted text are not interpreted. Double quoted strings are interpreted.

In the lz.ini file at the end is added

. /opt/IBM/ITM/config/lz.override

That starts with a period and a space and then the fully qualified name of the environment file. In Shell terminology that is a source include. By this method the override file is logically included into the end of the lz.config as it is used. The result is about the same. You can also see this is modern supported levels and they work. However environment files are best practice to be sure.

 

Windows

There is a lot of confusion about how to add and override environment variables in Windows. The first thing to remember is a Windows agent at startup reads from the Windows registry – NOT a xx.config file. If you are tempted to add some variables to a KXXENV file – which always exists – resist that temptation. When environment variables are found in both places, things often go very wrong. If you find an install that has done that, back out the changes and do things the right way.

The kinconfig.exe Program and its usages

This is relatively little known but it is documented here.  If you are familiar with the MTEMS GUI [Manage Tivoli Enterprise Management Systems] this program is the functional part of that program. It is present in the <installdir>\installITM directory.

The simplest usage is to add  environment variables. Here is a simple example
cd c:\IBM\ITM\InstallITM
kinconfg -aknt -oCTIRA_HOSTNAME=myhostname
kinconfg -aknt -oCTIRA_SYSTEM_NAME=myhostname
kinconfg -rknt

The -a specifies the product code – knt means Windows OS Agent. The -o option specifies the environment variable you want to add or change

If you want to change the communications so that EPHEMERAL:Y is set and the internal web server is not started:

kinconfg -aknt -oKDC_FAMILIES=EPHEMERAL:Y HTTP_SERVER:N @Protocol@
kinconfg -rknt

The @Protocol@ has the definition of the originally configured KDC_FAMILIES during installation.

If you have made the error of adding such variables to the KNTENV file, you will need to remove them. Here is a post which explains how to do that remotely:

Scrubbing Out Windows Agent Malconfiguration Remotely

Verifying modification changes.

The kinconfig program updates the Windows Registry with the -r option. It also records that in a disk file <installdir>\TMAITM6_x64\kntcma.ini. At the end of the kxxcma.ini file will be found a section labeled

[Override Local Settings]

where changes will be recorded. Remember that is documentation only, the Windows Registry is where the agent gets its information.

z/OS Configuration

ITM z/OS processes do not have an override procedure.

ITM z/OS processes use a PARMGEN process where controls are specified interactively and then batch jobs are generated [and run] to implement those controls. [There is also the prior scheme called ICAT.] The following can be used by anyone configuring a z/OS agent or TEMS.

One is a freeform WCONFIG(KDS$PENV) RTE imbed that is ideal for advanced-type parms. for the RKANPARU(KDSENV) z/OS TEMS. This could be used for KDEB_INTERFACELIST or CMS_FTO settings. The imbeds are documented in more detail here.

Another is a option KDS_X_KDE_TRANSPORT_GBL_OPTIONS  which supplies controls in the leading or global position for the KDE_TRANSPORT control. The z/OS processes always use that instead of KDC_FAMILIES. That is how HTTP_SERVER:N and other things can be specified.

i/5 Configuration

The i/5 platform [previously AS/400] stores environment variables in the file QAUTOTMP/KMSPARM file member KBBENV. There is no override mechanism.

Summary

This document describes ITM port usage and shows ways you can eliminate or control port usage.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Main Coon Bandit Christmas 2019

 

Sitworld: TEP Java Web Start – Basic Diagnostic Capture

TEP Java Web Start – Basic Diagnostic Capture

by

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 23 April 2018

Introduction

Tivoli Enterprise Portal [TEP] does not capture any diagnostic information by default. When you are experiencing any issue, basic diagnostic data will be very useful to identify what is actually happening.

Basic TEP Diagnostic Log Capture Instructions

Instructions:

Locate the Java used to start the TEP Java Web Start client. Usually that means the directory that contains the javaws.exe 

program. 

If using Oracle Java the location will be similar this: C:\Program Files (x86)\Java\jre8\bin

If IBM Java the location will be similar this: C:\Program Files (x86)\ibm\Java80\jre\bin

The java location often varies – my latest Oracle Java: C:\Program Files\Java\jre1.8.0_171\bin

This document will use the first path from above in this example for simplicity.

Run these steps on the TEP client machine (the machine where you start the TEP to connect to the TEPS).

1) Bring up a command prompt

2) Go to the directory containing javaws.exe (doubles quotes are needed becasue the path because it has a space in it). Your 

directory may differ depending on what version of Java you use:

cd “C:\Program Files (x86)\Java\jre8\bin”

3) Run this command to clear the cache. The “-uninstall” flag it only clears the cache…it does NOT uninstall java itself

javaws.exe -uninstall

4) Run this command to bring up the Java control panel and turn on tracing/logging. This is always needed because the uninstall in 

step (3) will clear the log and debug flags.

javacpl.exe

and see this

image

5) Click the Advanced tab and see this – after clicking the tracing and logging flags.

image

Click apply and ok until the control panel goes away

6) Start the webstart client directly from this same directory – substitute your TEPS server fully qualified name or IP address 

for the string <TEPS_Server>

javaws.exe http://<TEPS_Server>:15200/tep.jnlp

7) Assuming it still fails, collect  TEP client logs from their storage directory. The directory will be in a location similar to this depending on whether you’re using Oracle java or IBM java.

C:\Users\<user_ID>\AppData\LocalLow\IBM\Java\Deployment\log

C:\Users\<user_ID>\AppData\LocalLow\Sun\Java\Deployment\log

Summary

This shows how to collect basic diagnostics for Tivoli Enterprise Portal [TEP]. Most of the time that is enough to diagnose an issue.  A later revision will show how to set specific traces.

Sitworld: Table of Contents

THANKS!! to Terry Wright trwright@us.ibm.com who shared his notes on this subject. Those notes formed the outline for this post.

 

Sitworld: ITM Communications Validation

ITM Communications – Manual Validation

by

John Alvord, IBM Corporation

jalvord@us.ibm.com

Introduction

ITM Communication Services has requirements. When the requirements are not met things break in strange and non-obvious ways. Most communication is via TCP Socket links. After setup these are used to implement Remote Procedure Calls. This often works beautifully by default but in new environments it pays to perform some manual checks. It is also helpful when processes do not connect.

Manual Validation

Lets review the case where a hub TEMS has already been installed and is working. A new remote TEMS is installed and we want to validate the network is prepared. Usually after problems are resolved in one case, many cases are resolved.

1) The remote TEMS needs to know where the hub TEMS is located. This control is a file created during an install named glb_site.txt which is located in the

Windows: <installdir>\cms

Linux/Unix: <installdir>/tables/<temsnodeid>

z/OS: RKANPARU(KDCSSITE)

In the simplest case of a single hub TEMS, this will look like

protocol:htems

such as

ip.pipe:HTEMS

or

ip.pipe:#10.11.20.34

If there are two hub TEMSes [Fault Tolerant Option] you will see two such lines. That also requires the CMS_FTO=YES environment variable.

You should never have more than one line for a single hub TEMS. Two or more lines slow things down with no value.

The hub TEMS does not need a glb_site.txt. It does no harm to have one but doesn’t help anything.

2) To manually verify the setup is correct you can use use the glb_site.txt values to test.

ping HTEMS

or

ping 10.11.20.34

The ping commands will not always respond depending on the network. However you can at least verify that the name resolves correctly. If not the Domain Name Server [DNS] may have incorrect information or the \etc\hosts file might be incorrect.

3) To manually verify the hub TEMS is reachable use telnet. Assuming you are using ip.pipe communications

telnet 10.11.20.34 1918

If you use ip.spipe the port target would be 3660.

If this fails that means there is a firewall router along the network path which is missing the rule to allow such communications. If there is no firewall involved, no problems. However if a rule exists it must allow communication to the well known port – 1918 in this case. The rule must be bidirectional. If the test fails your networking support team must make changes in the router firewall rules to allow the communication. Until that is done, there is no hope of a remote TEMS to hub TEMS connection working.

4) Another ITM communication requirement is that the entire path allow DF [do not fragment] packets. The packet size is most commonly seen as 1500 bytes however ITM will work with anything. From a performance standpoint a small MTU leads to more transmissions and lower throughput. Following are the tests for 1500 byte packets using ping options:

Linux:  ping -M do  -s 1472 10.11.20.34

AIX:  ping -s 1472 10.11.20.34

HPUX: ping -pv 10.11.20.34 1472

Solaris: ping -D 10.11.20.34 1472

Windows: ping -l 1500 -f 10.11.20.34

If these work with no complaint – all is well. The Linux/Unix size setting adapts to an automatically added IEEE header. The Linux -M do option means REALLY no fragmentation, even locally. A typical error seen recently looked like this : 

From 10.99.0.250 icmp_seq=1 Frag needed and DF set (mtu = 1442)

That means along the network path, the router at that address is preventing packet transmission.

See (7) below for network performance comments.

Your networking support team must resolve this issue before ITM communications can possibly work but that can be relatively easy to managed.  See next section.

5) When 1500 byte packets fail.

One recent case had a Virtual Private Network [VPN] link in the path that added more bytes to the packet. A 1500 byte DF packet became a 1514 byte DF packet and an intermediate router dropped the packet and communication failed. The solution was to change the interface on the hub TEMS from MTU 1500 to 1350. The remote TEMS and hub TEMS negotiated a MTU size of 1350 and then the added VPN bytes did not exceed the 1500 byte DF maximum at the routers. They could have gone higher of course.  Changing MTUs on interfaces is platform dependent and you will normally get sysadmins or networking people involved to make such changes.

Another recent case was when a customer router was configured packet DF maximum packet size was1448 bytes. In that case the router was reconfigured to the more standard 1500 byte DF limit.

Another recent case was a Linux environment where the configured DF maximum packet size was 992 bytes. There was some good reason for this and so the hub and remote TEMS system interface MTUs was changed to that number.

A highly interesting case involved two remote TEMSes that were primary and backup for many agents. Half the agents had remote TEMS1 as primary and half the agents had the remote TEMS2 as primary. One day most of the agents were offline. We discovered that TEMS1 required a MTU of 1400 and was thus having serious issues connecting to the hub TEMS. The agents connecting to it were also having problems. Most agents switched to TEMS2. TEMS1 and TEMS2 next became entangled because of the Agent fallback to primary logic. During the attempted switch from TEMS2 to TEMS1, agents became stuck and offline to both.  When the interface that TEMS1 used was set to MTU 1400 and TEMS1/Agents were restarted, things started working. When TEMS2/Agents were restarted, things continued OK. After 75 minutes the agents with TEMS2 as primary TEMS migrated back to that TEMS.

6) z/OS Hypersockets

Another recent issue involved z/OS Hypersockets. It had a MTU of 16K and its logic prevented negotiating down to 1500 bytes. The solution was to configure a second Hypersocket instance set to a MTU of 1500 bytes.

7) TEMS to TEMS communication requirements

In step (4) earlier – note the rtt average and the per cent packet loss. TEMS to TEMS communication is unstable if the rtt average is too high *or* if there is much packet loss. A general rule of thumb is that 50 milliseconds or lower is best. 100 milliseconds is OK. At 250 milliseconds or higher many installations will see instability including remote TEMS going offline.

These rules are extremely general and depend on the amount of TEMS to TEMS network traffic. A low traffic environment with not that much communications can often survive at higher latency levels.

The reason for this sensitivity in TEMS/TEMS communications is that much of the work happens with Remote Procedure Calls. After starting up, there are large call structures, up to 30,000 bytes or more. ITM divides each call into MTU [Maximum Transmission Unit] sized separate packets. All packets must arrive and be assembled at the target before logic can continue. If there is any degree of packet loss, many such attempted RPCs fail and need to be re-transmitted. At a higher level in ITM communications there are time out rules for transmission, typically 30 or 60 seconds. In cases of high latency and some packet loss, the resulting failures actually prevent normal work from proceeding. That means the remote TEMS does not get full instructions, like Situation definitions. It also means that the remote TEMS – which has been gathering situation results and likely generating events – is unable to send the events back to the hub TEMS.

The usual solution for a high latency link is to architect a hub TEMS at that location. That is extra work of course but that may be less expensive compared to upgrading a network. The hub TEMS to an event receiver like Netcool/Omnibus is relatively insensitive to event data transmission.

8) Use traceroute [Unix/Linux: traceroute; Windows tracert] checks on each communication path.The traceroute lists the network points encountered along the way from source to target. The network points should be the same but in reverse order when comparing results between the two points. If that is not true you have asymmetric routing and apparently that is rarely good.

In one recent case there was a new remote TEMS and a number of agents that were configured to connect to the remote TEMS. One symptom was that none of the agents could connect reliably. A second symptom was that agent TCP activity  broke the Hub TEMS to Remote TEMS initial table synchronization on a large [20,000 row] table. After this was identified by the customer network staff and corrected the problem were fully resolved. The lead network System Administrator stated 

“The break came when … we ran traceroutes from Rtems to the clients and clients to Rtems and noticed the paths weren’t the same.”

This is known as asymmetric routing. The problem is virtually invisible to ITM communications. You may need to involve your network folks because traceroute is sometimes blocked for normal users. If there is a difference, the network folks will need to figure out why the differences exist and adjust the network routing to correct the issue.

Summary

There are other potential issues. The only good news is that most such cases are rare and that ITM has the controls to adapt to almost any environment. Contact IBM Support if further help is needed.

Sitworld: Table of Contents

If you are interested in ITM communication control options see this document:

Sitworld: ITM Protocol Usage and Protocol Modifiers

Sitworld: Scrubbing Out Windows Agent Malconfiguration Remotely

Sitworld: Scrubbing Out Windows Agent Malconfiguration Remotely

Cruse_Ship_Control

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 6 Februrary 2019 – Level 1.00000

Follow on twitter

Introduction

Sometimes there is a mistake made in ITM Windows Agent configuration. It is made with good intentions but the result is an ITM Agent which constantly loses connection to the TEMS it is configured to and then reconnects over and over. This prevents normal monitoring operations at that agent. It triggers heavy TEMS activity and can even result in TEMS crashes if enough agents have the same incorrect configuration.

Background

ITM Agents for Windows get most of their control information from the Windows registry. At the same time there is a file KXXENV which can contain environment variables. The install time configuration sets up those two sources of data. For 64 bit agents the default spot where the KXXENV file is located is C:\IBM\ITM\TMAITM6_x64.

The issue arises when you want to alter the communications string. One example would be to disable the internal web server by adding HTTP_SERVER:N. The original communication string from install time configuration values might look like this [copy from a registry entry for Windows OS Agent]

[HKEY_LOCAL_MACHINE\SOFTWARE\Candle\KNT\Ver610\Primary\Environment]

“KDC_FAMILIES”=”IP.PIPE PORT:1918 IP use:n SNA use:n IP.SPIPE use:n IP6 use:n IP6.PIPE use:n IP6.SPIPE use:n”

and the need is to run with this

“KDC_FAMILIES”=”http_server:n IP.PIPE PORT:1918 IP use:n SNA use:n IP.SPIPE use:n IP6 use:n IP6.PIPE use:n IP6.SPIPE use:n”

You can read all about changes to communication strings in a post of Protocol Modifiers. There are enough of them to induce sleep.

Distributed agents use only KDC_FAMILIES to define communication protocol etc. There is another parallel environment variable KDE_TRANSPORT which is used in z/OS TEMS and Agents.

The problem comes when distributed agents are configured with both KDC_FAMILIES and KDE_TRANSPORT. These combination does not place nice together and almost always causes problems. The only exception comes if you can arrange for them to be with identical settings. However that is very difficult since the Windows Registry entry is created during the Windows agent installation and the KXXENV file is also created but can be hand changed.

When both are defined, usually one in Windows Registry and another in the KXXENV, things go horribly wrong. Different parts of ITM at the agent can use one or the other and if they are different that results in communication outages.

Most importantly – don’t do that. Do not set KDC_FAMILIES or KDE_TRANSPORT into the KXXENV file. Don’t even think about it… you will waste weeks of effort suffering the consequences and then weeks of effort undoing that change. You may think have read that as a possible way to go but it is a terrible plan. The two values do not play nice together unless they are identical. Usually they fight like mad and waste everyone’s time. They do not magically merge values, they brawl like ruffians.

Agents are in trouble, what do you do?

First make sure that whatever caused the problem is stopped. In most cases that is a post-install script that files and updates the KXXENV file. So change that script to NOT do that any more.

For the agents in trouble, here is a procedure that was worked out at a site that had 5,500 problem agents. The example is for the Windows OS Agent and you need to adapt it to the agent that needs working on. In this case it was known that KDE_TRANSPORT was present. If KDC_FAMILIES is also present, it needs to be deleted also.

1. Check the KNTENV file on the system to make sure it has the problem.

    tacmd executecommand -m Primary:VA33VTWSFC003B:NT -c “type \IBM\itm\TMAITM6_x64\KNTENV” -l -o -v -e -r

Results :

KDEBE_KEYRING_STASH=C:\IBM\ITM\keyfiles\keyfile.sth

KDEBE_KEY_LABEL=IBM_Tivoli_Monitoring_Certificate

KBB_IGNOREHOSTENVIRONMENT=Y

JAVA_HOME=C:\IBM\ITM\java\java50\jre

KBB_IGNOREHOSTENVIRONMENT=N

KDE_TRANSPORT=HTTP_SERVER:N HTTP_CONSOLE:N EPHEMERAL:Y

GSK_PROTOCOL_SSLV2=OFF

GSK_PROTOCOL_SSLV3=ON

GSK_V3_CIPHER_SPECS=352F0A

KUIEXC000I: Executecommand request was performed successfully. The return value of the command run on the remote systems is 0

2. Run the following command to remove KDE_TRANSPORT from KNTENV :

    tacmd executecommand -m Primary:VA33VTWSFC003B:NT -c “type C:\IBM\ITM\TMAITM6_x64\KNTENV | findstr /v KDE_TRANSPORT= >c:\temp\KNTENV.NEW && copy /Y c:\temp\KNTENV.NEW  C:\IBM\ITM\TMAITM6_x64\KNTENV >NUL” -l -o -v -e -r

3.  Check Results:

    tacmd executecommand -m Primary:VA33VTWSFC003B:NT -c “type \IBM\itm\TMAITM6_x64\KNTENV” -l -o -v -e -r

Results :

KDEBE_KEYRING_STASH=C:\IBM\ITM\keyfiles\keyfile.sth

KDEBE_KEY_LABEL=IBM_Tivoli_Monitoring_Certificate

KBB_IGNOREHOSTENVIRONMENT=Y

JAVA_HOME=C:\IBM\ITM\java\java50\jre

KBB_IGNOREHOSTENVIRONMENT=N

GSK_PROTOCOL_SSLV2=OFF

GSK_PROTOCOL_SSLV3=ON

GSK_V3_CIPHER_SPECS=352F0A

KUIEXC000I: Executecommand request was performed successfully. The return value of the command run on the remote systems is 0

4. Fix KDC_FAMILIES in the registry

$CANDLEHOME/bin/tacmd setagentconnection -n Primary:VA33VTWSFC003B:NT -t NT  -e KDC_FAMILIES=”HTTP_SERVER:N EPHEMERAL:Y @Protocol@”

Check Results:

Validate that the agent restarted

Check kntcma.ini that the override settings are in place

tacmd executecommand -m Primary:VA33VTWSFC003B:NT -c “type \IBM\itm\TMAITM6_x64\kntcma.ini” -l -o -v -e -r

ExitDLL=@EtcPath@\KNTCTRD.DLL

PostProcess=DllRegisterUnregisterServer

[KIN64BIT]

[Override Local Settings]

KDC_FAMILIES=HTTP_SERVER:N EPHEMERAL:Y @Protocol@

CTIRA_HIST_DIR=@LogPath@\History\@CanProd@

KUIEXC000I: Executecommand request was performed successfully. The return value of the command run on the remote systems is 0

5) At last check that HTTP is disabled via a Web browser to the system where the agent is running : http://xx.xx.xx.xx:1920/  where you substitute actual system ip address for xx.xx.xx.xx

If that fails the internal web server is not running.

Real Life Variations

The KXXENV file might have a KDC_FAMILIES also, which also needs to be removed.

The KXXENV KDE_TRANSPORT value might have other protocol modifiers needed such as EPHEMERAL:Y. In that case the tacmd setagentconnection above needs to be corrected.

In one case, the Windows Registry itself had been updated to contain a KDE_TRANSPORT. In that case you need to go back in manually and remove that setting

1) Start MTEMS [Managed Tivoli Enterprise Monitoring Systems]. This process will restart the agent

2) Right click on Agent line

3) Select Advanced

4) Select Edit Variables…

5) Locate the problem added variable KDE_TRANSPORT, select it and click on delete.

Summary

This documents a best practice method of removing a problem configuration case from an ITM Agent running on Windows.

Sitworld: Table of Contents

History and Earlier versions

1.00000

Initial publication

Photo Note:  Cruise Ship Under Construction – Control Room

 

Sitworld: AOA Critical Issue – TEMS to TEMS High Latency network connection

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – High Latency Connection between two TEMS – usually a hub TEMS and a remote TEMS.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

Network issues

temsaud.crit: Early remote SQL failures [&syncdist_early]

TEMS to TEMS communication requires relatively low latency communications and near zero packet loss. There is no absolute rule about when problems occur. However many large customers keep latency under 20 milliseconds. Many customers have latency at 100 milliseconds . At 250 milliseconds or more most customers have problems. The symptoms are many – basically the distant TEMS will show as offline and not do the expected work. This all depends on how much work is happening. With less work there is a better change of success,

One useful tool is APM: ITM Communications Validation. Especially useful is the special form of ping using large packets in the Do Not Fragment mode. That is basically what ITM uses.

Often the quoted message above is seen on high latency links. When first connecting to the hub TEMS, the remote TEMS copies a number of large tables using Remote SQL which has a default timeout of 600 seconds. When that fails, the message is produced.

If the latency cannot be reduced, the usual work around is to configure a hub TEMS at the distant site. It can send events to an event receiver and that process is not latency sensitive.

Summary

This document shows how to manage high latency issues between two TEMSes.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – TEMS Possible TCP Blockage

lemons

Version 1.10000 –  5 December 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document the issue where there is evidence of TEMS database file damage.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

We are still learning about this following rare condition. Some cases have been diagnosed and this document will be updated as we have new information.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

TEMS Possible TCP Blockage

This error is identified in the TEMS Audit task:

tmsaud.crit: Possible TCP Blockage: Recv-Q[13,1290] Send-Q[10,33203]

This a relatively unusual condition. TCP communications traffic normally flows smoothly and if on Linux/Unix you do this command “netstat -an” you will see something like this

Active Internet connections (including servers)

Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)

tcp        0      0  *.21                   *.*                    LISTEN

tcp4       0      0  9.155.11.97.22         9.3.4.141.62599        ESTABLISHED

tcp        0      0  *.111                  *.*                    LISTEN

tcp4       0      0  9.155.11.97.22         9.30.70.5.64999        ESTABLISHED

Recv-Q is the number of bytes in the receive queue… ready for reading but not processed yet.

Send-Q us the number of bytes in the send queue… ready to send but not delivered yet.

Local Address is on the current system. The two above port 22 is for the SSH daemon if I remember.

Foreign Address is the remote system.

The Critical issue warning is limited to TCP sockets reading from or writing to ITM associated port numbers. The above example critical issue text mean that there were 13 ITM related socket connections where the Recv-Q was over 1024 bytes  and the maximum was 1290 bytes. There were also 10 Send-Q ITM socket connections over 1024 bytes and the maximum was 33203. Normally you would review the TEMS Audit report itself to see the details.

External Symptoms

This is most often seen at a remote TEMS and the remote TEMS shows as going offline solidly or intermittently. Recycling the remote TEMS often temporarily relieves the issue but it often recurs. It was seen at a hub TEMS once.

Diagnoses

Some cases have been diagnosed and follow.

1) A TEMS to Agent socket showed high Send-Q. When the ITM agents on that foreign system were stopped, the remote TEMS was stable after a TEMS restart. On the system where the ITM agents were stopped, there was a database server running with extremely high amounts of TCP traffic – mostly through localhost. The admins for that sgent system were involved and they recycled the database server and the problem stopped happening. The ITM agents connected afterwards and ran fine with no impact on the TEMS.

2) A problem similar to (1) but at the system running the agents, there was an very busy ITM Summarization and Pruning agent that was running almost 24×7. The S&P was re-configured to use only a single thread instead of eight threads. After that the remote TEMS ran without any impact.

3) Client had a large system with 5000+ mal-configured Windows OS Agents. In particular each OS Agent had the normal KDC_FAMILIES specified in the Windows Registry and also [invalidly] a KDE_TRANSPORT= line in the KNTENV file. This caused constant switching back and forth and this TCP blockage occurred many times a week at several remote TEMS. In this case the netstat -an showed a large number of foreign systems with high Recv-Q buffer bytes. When the Windows OS Agents were properly configured the TEMSes ran without incident.

4) Several large installations had many accidental duplicate agent name cases. This caused many issues and TCP Blockage was seen in some pdcollects.

5) In one case the customer had a single WPA [HD agent – collects historical data from agents for trans-shipment to the database.] At times this intense activity caused TCP Blockage condition.

6) A site had a very high level of Virtual Hub Table updates. This causes intense communication loading every few minutes, all concentrated in a single second. This was seen to cause TCP blockage in some cases. See Sitworld: ITM Virtual Table Termite Control Project for how to correct the issue.

7) A remote TEMS was going constantly offline and then online at the hub TEMS. The netstat -an at the remote TEMS showed a single agent with High Send-Q and many agents with low Recv-Q. We logged into the system running the High Send-Q agent. It was a Unix OS Agent and that was the only ITM agent running. A normal stop was issued ./itmcmd agent stop ux, and the stop failed. A forced stop was issued ./itmcmd agent -f stop ux and this worked. The Unix OS Agent was started ./itmcmd agent start ux. After that the remote TEMS behaved normally.

Recovery Action

If there are a few large Send-Q buffer cases, identify what ITM agents are running on the foreign addresses and stop them after getting pdcollects for IBM Support to review. Remember that in most cases there will be a few problem cases and a lot more victims. Look for something at the agent system generating high TCP traffic.

Look for general problems as seen in the TEMS Audit: duplicate agent name cases, agents with many listening ports, etc. If the remote TEMS or hub TEMS is overloaded that should be alleviated as part of the solution.

Summary

The information in the report explains how to manage a AOA Critical Issue concerning possible TCP blockage.

History

1.01000

Added example case 7.

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: TEMS Database File Damage

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document the issue where there is evidence of TEMS database file damage.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

 

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

TEMS Database Files with errors

One type of error comes from the AOA interface programs. These convert the TEMS database files from the .DB format into text files.

itm_ref_checker.crit: QA1CSTSH.DB:unexpected size difference at tems2sql.pl line 1066.

It is also seen from  itm_tems_eventaud.crit. The itm_ref_checker checks more files. Not all files are checked in the prepare stage.

Different errors are seen from TEMS Audit. There could be additional errors which may be added later.

 

temsaud.crit:TEMS database table $f with $etct Open Index errors

temsaud.crit:TEMS database table $f with $etct Verify Index errors

temsaud.crit:TEMS database table $f with $stct RelRec errors

If this occurs with a hub TEMS database file, you must proceed very carefully and only with IBM Support help. There are certain files or pairs of files that can be replaced. However many of the hub TEMS database files contain critical information such as situation definitions. If those are reset, that data could take weeks to recover manually and no one wants that. While we are on that subject please should read  Sitworld: Best Practice TEMS Database Backup and Recovery and implement a proper TEMS database backup plan.

The TEMS database file must be corrected. For cases involving remote TEMSes the answer is simple: just replace the TEMS database files with emptytable files. They are not all empty but they are in the same state as during a new TEMS install. This post Sitworld: TEMS Database Repair contains pretty much all you need to know including links to files containing the emptytable files for Unix/Linux/Windows. We usually suggest replacing all the files on a remote TEMS since errors may be present but not diagnosed through this report.

Do exactly the same If you have a problem with a FTO Mirror hub TEMS and you have confidence in the existing FTO Primary hub TEMS.

We rarely know exactly why the damage happened. A system power off while the TEMS is running was one case. Another was a manual copy of the index file from one system to another – not copying the data file – which did not work well at all. Another was a restoration of all files from an on-the-fly TSM backup. In any event, having a good backup always helps matters.

Sample Recovery Action Plan Template for TEMS Database Files – Remote TEMS

Here are instructions for REMOTE_ibm *REMOTE to reset the TEMS database files to emptytable status.

The instructions could be duplicated on any remote TEMS. [NEVER on the hub TEMS!!!]

This is concerning the remote TEMS REMOTE_ibm that keeps experiencing problems.

There is evidence that the remote TEMS has a broken database file.

The idea is to refresh all the remote TEMS database files and let them be rebuild

naturally from the hub TEMS as in a new install.  Here are the instructions:

0) Here is how to access the needed file

You would get them from the TEMS Database Repair post links. The following assumes use of

ITM630_emptytables.bigendian.tar which is used for AIX/HP-UX/Solaris/Linux on Z platforms

at the ITM 630 level

1) copy that file ITM630_emptytables.bigendian.tar [in binary] to the remote TEMS system REMOTE_ibm

I suggest /opt/IBM/ITM/tmp

2) un-tar that file

cd /opt/IBM/ITM/tmp

tar -xf ITM630_emptytables.bigendian.tar

This will create empty QA1* files. They are not entirely empty, but

they are in the same state as they would be during a new install. We

are going to use all files and it would be perhaps useful to save

them for the future. In general you should not use these except with

advice and instruction from IBM Support.

3) Change the empty table file attributes so they are identical to the

current ones which you can verify this way:

ls -l  /opt/IBM/TEMS/tables/REMOTE_ibm/QA1CSTSH.DB

I think I see

-rwxrwxrwx  1 root      system    35274789 Sep 24 08:08 /opt/IBM/TEMS/tables/REMOTE_ibm/QA1CSTSH.DB

and I think the following will do the work – but please verify

cd /opt/IBM/ITM/tmp

chmod 777 QA1*.*

chown root QA1*.*

chgrp system QA1*.*

4) Stop the remote TEMS when convenient.

5) Copy the emptytable files to the tables directory

cd /opt/IBM/TEMS/tables/REMOTE_ibm

cp /opt/IBM/ITM/tmp/QA1*.* .

6) Start the remote TEMS

7) Monitor for stability and normal operations – for example remote TEMS staying online.

Summary

The information in the report explains how to manage a AOA Critical Issue concerning TEMS database files which are damaged.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – Port Scanning Testing

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – port scanning of ITM processes.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

Port Scanning Testing

temsaud.crit: Definite Evidence of port scanning [$scantype] which can destabilize any ITM process including TEMS

Read the following development approved document for how ITM behaves in response to port scanning tests

APM: Port scanner usage and known limitations with IBM Tivoli Monitoring  

ITM will do its best to defend against such conditions, but that usually involves stopping existing connections and thus breaking communications and monitoring. Do not perform port scanning on ITM processes. The alternative is to be prepared to recycle ITM processes after such a test.

Warning When Not Port Scanning

There have been recent cases where “port scanning” type error messages are seen when there is some other condition. One example was when access to a TEMS was set up via a network proxy. The TEMS communications did not understand that sort of communications and rejected it. As time goes by we may see other issues which show as Port Scanning when another issue is happening.

Summary

This documents how to handle Port Scanning testing which can cause ITM processes to become unstable.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – High Virtual Hub Table Updates

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – High Virtual Hub Table Updates.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

Virtual Hub Table Updates

datahealth.crit: Virtual Hub Table updates peak $peak_rate per second more then nominal $opt_peak_rate –  per hour [$vtnode_tot_hr] – total agents $vtnode_tot_ct – See DATAREPORT020

This is a relatively common condition where certain agents stress the remote and hub TEMS by sending updates to hub TEMS in-storage tables which are not used for anything useful. This critical level is recorded  if the incoming work is more than 32 arriving in each peak second [every 1 or 2 or 3 minutes]. At the 32 level that only consumed 32 of the 64 ITM communication pipes. For context the peak second has been calculated at more 3000 and the higher the rate the worse the problem. The issue and background is documented here

Sitworld: ITM Virtual Table Termite Control Project

https://www.ibm.com/developerworks/community/blogs/jalvord/entry/sitworld_itm_virtual_table_termite_control_project?lang=en

Only a relatively small number of agents are involved:

HV – Monitoring Agent for Microsoft Hyper-V Server

OQ – Monitoring Agent for Microsoft SQL Server

OR – Monitoring Agent for Oracle

OY – Monitoring Agent for Sybase Server

Q5 – Monitoring Agent for Microsoft Cluster Server

One agent distribution was altered in THe ITM 623 GA time frame to avoid the issue:

UX – Monitoring Agent for UNIX OS

Traditionally IBM Support creates the recovery action plan and needed files, however you are welcome to use the above tool.

Example Recovery Action plan – Unix/Linux Style

This environment is being hit very hard with virtual hub table updates.

This ITM area is not that well known and I documented it here

Sitworld: ITM Virtual Table Termite Control Project

https://www.ibm.com/developerworks/community/blogs/jalvord/entry/sitworld_itm_virtual_table_termite_control_project?lang=en

In your case every 2 minutes there is a burst of 2074 incoming updates

from the 513 OQ [Agent for Microsoft MS-SQL] agents and the 11 OY [Agent for Sybase] agents.

These update in-storage tables which are not, in fact, used for anything.

The data volume is not that high, but the sudden bursts occurring at the

same second can cause delays and time outs in communication. At the very

latest levels, ITM communications is limited to only 64 at a time.

This recovery action plan eliminates the objects and then recycles all

the affected agents. This might have to be repeated if some remote TEMS

agents are offline. The files are available for access here

The following are then usual files needed to implement the recovery plan.

You can create them yourself or IBM Support can create the files from a

hub TEMS pdcollect.

delete.sql

recycle.sh

recycle.cmd

show.sql

1) Copy the recycle.sh file to the hub TEMS /opt/IBM/ITM/bin.

2) Login to the system running the TEPS and copy the delete.sql file

to /tmp and then use it to update the TEMS database files. The following

assumes you use the same install directory as the hub TEMS… otherwise

use that bin directory.

cd /opt/IBM/ITM/bin

./itmcmd execute cq “KfwSQLClient /v /f /tmp/delete.sql”

some people like to do a show.sql before and after…

This will delete all the potentially problem objects from all the

TEMSes… including some objects that might not be installed. This can

be done anytime and has no immediate effect on any agents running.

There are two ways to complete the work. Each work well and you only

need to do one.

3) Recycle all the agents involved:

On the hub TEMS

cd /opt/IBM/ITM/bin

./tacmd login -s …. [to hub TEMS]

sh recycle.sh

This will recycle all the OQ and OY involved. They will get new

instructions NOT including the problem/unuseful ones. This process will,

of course, recycle all the 513 Agents for Microsoft MS-SQL and the 11

Agent for Sybase agents. So that might need some scheduling…

4) As an alternative, you can recycle all TEMSes – hub and remote TEMSes.

Either (3) or (4) will work just fine – you only need to do one. 

Summary

This documents the High Virtual Hub Table Updates condition and how to cure it.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – High Incoming Workload

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – high incoming workload, usually from situations.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

 

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

High TEMS workload indications

eventaud.crit: Estimated Incoming result rate $ppc_result_rate worried $ppc_worry_pc

temsaud.crit: Hub TEMS has lost connection to HUB $hublost_total times

temsaud.crit: High incoming results $trespermin per minute worried[$wpc]

TEMSes can be destabilized by high incoming workload. That is usually from agents sending situation result data. Addition sources are from agents sending historical data, from real time data request, and from agents that do SQL for internal purposes such as ITCAM for Transactions. However it is mostly situation results. When a situation is true, the agent sends confirmation results each sampling interval. That composes most of the situation workload.

The usual worry point is 500K bytes/minute or 100% worry. That choice is taken from experience. Certainly installations can go higher or run into problems at a lower point. It all depends on the system where the TEMS is running and how much capacity and network performance is available.  The peak rate seen was 93 megs/min and the 128 [and 8 hub TEMSes] were just about killed.

The eventaud.crit incoming results creates an estimate of workload based on the recent history. It sometimes under estimates the actual load because a situation could be true but not recorded in the 8192 wrap around data. If you see it high, reality might well be higher.

The parallel temsaud.crit incoming results requires a TEMS trace to be present KBB_RAS1=error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er). Some clients turn that on permanently. The diagnostic trace added output is minimal [one line per result set arriving].

The last indication “Hub TEMS has lost connection to HUB” implies a severe hub TEMS work overload. The warning message is paradoxical but makes sense in context. The SITMON process is attempting to update a status using an SQL to the dataserver. There has been a time out of 20 minutes and the SQL is not complete. Most times that is a severe workload issue… however it could be other things such as excessive TEMS action commands or a external process starving the hub TEMS of cpu time.

Often these need a proper TEMS Audit workload trace and analysis. When situation(s) is identified as a culprit it can be evaluated for reasonableness. Situations should be

1) Exceptional

2) Rare

3) Fixable

4) Resources available to fix issue

If those do not apply, it is a waste of resource to run the situation and send out tickets. One one memorable occasion, 80% of a hub TEMS workload came from a single situation on a single Unix OS Agent running on a system that was supposed to be powered off and decommissioned. The situation was not associated with any TEPS node and was not forwarded to an event receiver… it was just sitting in the background burning up resources and hurting important processing.

Summary

The information in the report will show how to handle cases when the TEMS is being subjected to high workload.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – Excess MS_Offline type Situations

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – excess MS_Offline type situation usage.

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

MS_Offline conditions

MS_Offline dataserver evaluation rate $prate agents/sec dangerously high

MS_Offline SITMON evaluation rate $prate agents/sec dangerously high

MS_Offline type situations – $miss_reason are missing the Reason *NE FA test. See DATAREPORT017

MS_Offline type situations are high impact and too many running to often in a large system can affect hub TEMS stability. See Sitworld: MS_Offline: Myth and Reality  for a deep dive introduction to the operation of the offline process. The quick takeaways are

1) The Offline detection process is rather leisurely, on the order of 10-20 minutes so a low sampling interval like 1 minute or 30 seconds wastes resourcesto little advantage.

2) The test Reason *NE FA test is critical to avoid checking offline status when the hub TEMS has literally no idea of current status like just after startup.

3) The dataserver evaluation rate is how many times a second the TEMS dataserver or SQL process has to evaluate

4) The SITMON evaluation rate is how often the Situation Monitor logic has to calculate what is happening. This is mostly driven by MS_Offline type situations with Persist>1 but it is also driven by offline agents. SITMON evaluation is 10-20 times more expensive than dataserver evaluation.

The recovery action plan is simple: Stop all the MS_Offline type situations and set Run at Startup to off. Use the product provided MS_Offline situation but specified to 10 or 15 minutes.  If multiple end users must be notified, do that work in the event receiver. In one recent small environment without an event receiver, the email alert was changed to a shell command which determined who needed to be emailed based on product or agent name. Never do that for large numbers of agents because those emails can all start at once and destabilize the hub TEMS.

Summary

This document explains about excess MS_Offline type situation activity and how to correct the issue

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: AOA Critical Issue – Duplicate Agent Names

lemons

Version 1.00000 –  8 October 2018

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks.  At the same time the reports have become more complex and challenging to digest.

With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – Duplicate Agent Name cases

Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.

Getting more information

If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.

TEMS Audit – temsaud.csv [any hub or remote TEMS]

Database Health Checker – datahealth.csv [any hub TEMS]

Event History Audit – eventaud.csv [any hub or remote TEMS]

There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.

Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.

Duplicate Agent Name cases

temsaud.crit: $idupcnt duplicate agent name cases – see TEMSREPORT069 Report

ITM depends  on each agent having a unique name. When that is violated the TEMS can become unstable and the TEPS experiences severe performance problems. Even worse in some ways, the agents affected are not being properly monitored since only one at a time can report conditions. On a more prosaic note, people using the event data often waste considerable time because an event on one agent, when investigated, shows a false positive – not a problem. However there was a real problem but it was for another agent with the same name on another system.

You can make progress by asking for the full TEMS Audit report. You can also use Sitworld: TEPS Audit  which report on duplicate agents the TEPS is aware of.

You also need to discover how the duplicate agents are being created and change the process so that duplication is avoided.

You can also involve IBM Support because there are extended TEMS tracing which will show other cases – hidden within the remote TEMS.

Summary

This documents explains how to handle duplicate agent name cases.

History

1.00000

Initial release

Note: 2018 – Home Grown Meyer Lemons

 

Sitworld: Agent Diagnostic Log Communications Summary

CrusieShipEnergy

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #3 – 6 August 2020 – Level 0.63000

Follow on twitter

Inspiration

I was working through a case where an Agent kept losing connection with a remote TEMS. Seeing the big picture was very tough, the raw data was scattered here and there through many diagnostic log instances. After spending a day collecting cut/paste notes from diagnostic logs I realized an earlier project Sitworld: Agent Workload Audit had accomplish something vaguely similar but more complex. So I spent a few days cloning that project and writing this communications summary report.

ITM Agent Diagnostic Log Communications Summary Installation

The Agent Diagnostic Log Communications Summary package includes one Perl program logcomm.pl. It is contained in a zip file logcomm.0.63000. The program has been tested in several environments using data from other environments. Windows has had the most intense testing. It was also tested on Linux. Many Perl 5 levels will be usable. Here are the details of the testing environments.

  1. Strawberry Perl 26.1

perl -v

This is perl 5, version 26, subversion 1 (v5.26.1) built for MSWin32-x64-multi-thread

2) Perl on Linux on Z

# perl -v

perl -v

This is perl, v5.10.0 built for s390x-linux-thread-multi

Copyright 1987-2007, Larry Wall

Agent Diagnostic Log Communications Summary Configuration

The Agent Diagnostic Log Communications Summary package has controls to match installation requirements but the defaults work well. All controls are in the command line options. Following is a full list of the controls.

The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.

command default notes
-z off Log is RKLVLOG from z/OS agent
-o logcomm.csv Report file name
-h <null> Help messages
-v off Messages on console also
-nohdr off Do not print report header files
-logpath off Path to log files
-pc off defined agent product code involved.
-allinv off Use with -pc to generate reports for each  diagnostic log collection in separate reports. Will also create a merge.csv of all summary report sections.

The parameter left over is the log file name specification. That can be a single file  or it can be a partial diagnostic file name. For example if a example diagnostic log name is nmp180_lz_klzagent_5421d2ef-01.log the filename specifier is nmp180_lz_klzagent_5421d2ef.

The diagnostic log segments wrap around in a regular pattern. The Agent Workload Audit calculates the correct analysis order. In some cases that order is incorrect and a manual collection mist be created. This usually shows when a values in the report show a negative time value.Agent Workload Audit Usage.

Note: The -z option for z/OS agent logs will be validated later. You are welcome to try it now and if there are issues please contact the author. The basic logic has worked “forever” in TEMS Audit but testing is always an important step.

Agent Diagnostic Log Communications Summary Usage

There are no special configuration options needed for this tool.

z/OS Agent Configuration

This is not tested yet. If you are interested please contact me.

Usage

Make the agent logs directory be the current directory.

1) Run against a specific log file

perl logcomm.pl hpcnvhc1_lz_klzagent_5b6b11e0-01.log

output will be in logcomm.csv

2) Run against a specific agent type

perl logcomm.pl -pc lz

output will be in logcomm_lz.csv

3) Run against all logs recorded in the inventory file – in this case  hpcnvhc1_lz_klzagent.inv

perl logcomm.pl -pc lz -allinv

Individual reports will be created and also a merge.csv file which sometimes goes back a year!

Agent Diagnostic Log Communications Summary report

Advisory Message Report – *NOTE* See advisory notes at report end

Impact,Advisory Code,Object,Advisory,

90,COMMAUDIT1001W,COMM,Activity Not in Call count [62]

90,COMMAUDIT1002W,COMM,Invalid Transport Correlator error count [32]

COMMREPORT001: Timeline of TEMS connectivity

LocalTime,Hextime,Line,Advisory/Report,Notes,

20180808115304,Log,Start

20180808115305,REMOTE_odibmp003,ip.spipe:#151.171.86.23[3660],Connecting to TEMS,

20180808120935,REMOTE_odibmp003,ip.spipe:#151.171.86.23[3660],reconnect to TEMS REMOTE_odibmp003 without obvious comm failure after 0/00:16:30,

20180808120935,REMOTE_odibmp003,ip.spipe:#151.171.86.23[3660],Connecting to TEMS,

20180808121105,REMOTE_odibmp003,ip.spipe:#151.171.86.23[3660],reconnect to TEMS REMOTE_odibmp003 without obvious comm failure after 0/00:01:30,

20180808121105,REMOTE_odibmp003,ip.spipe:#151.171.86.23[3660],Connecting to TEMS,

……

COMMREPORT002: Timeline of Communication events

LocalTime,Hextime,Line,Advisory/Report,Notes,

20180808115304,5B6B11E0,18,Log,Start,

20180808115304,5B6B11E0,70,EnvironmentVariables,KDE_TRANSPORT=KDC_FAMILIES=”HTTP_CONSOLE:N HTTP_SERVER:N HTTP:0 ip.spipe port:3660 ip.pipe use:n sna use:n ip use:n ip6.pipe use:n ip6.spipe use:n ip6 use:n HTTP_SERVER:N”,

20180808115304,5B6B11E0,74,EnvironmentVariables,KDEB_INTERFACELIST=”!151.171.33.235″,

20180808115305,5B6B11E1,1149,ANIC,14fe484587be.42.02.97.ab.21.eb.7e.b5: 1,1,5B4B1265,5B4B1265,

20180808115305,5B6B11E1,1167,ANIC,14fe4845886c.42.02.97.ab.21.eb.7e.b5: 1,1,5B4B1265,5B4B1265,

20180808115305,5B6B11E1,1258,OPLOG,Connecting to CMS REMOTE_odibmp003,

20180808115305,5B6B11E1,1261,Communications,Successfully connected to CMS REMOTE_odibmp003 using ip.spipe:#151.171.86.23[3660],

20180808115305,5B6B11E1,1261a,Communications,3660,

20180808115305,5B6B11E1,1603,ANIC,14fe4845badc.42.02.97.ab.21.eb.7e.b5: 1,1,5B4B1265,5B4B1265,

20180808115305,5B6B11E1,1703,ANIC,14fe4845bea2.42.02.97.ab.21.eb.7e.b5: 1,1,5B4B1265,5B4B1265,

…..

COMMAUDIT1002W

Text: Invalid Transport Correlator error count [count]

Tracing: error

+5B6B15BF.0001     e-secs: 0                  mtu: 944         KDE1_stc_t: 1DE0004D

Meaning: This is a strong signal of a duplicate agent case.

ITM uses remote procedure calls to do most of communications

and this error means that the partner in the communication process

rejected the attempted communication because the type of communication

did not match. For example a ip.pipe communication was sent

but the partner knew it needed a ip.spipe. It could also be a

conflict between a simple connection and a EPHEMERAL:Y connection

or many other cases.

Recovery plan: Investigate the TEMS the agent connects

to for evidence of duplicate agents – especially this one –

and resolve the issue.

What to do with the Report

It is most important to correlate logged events with agent configuration, network incidents. This report will summarize what happened but will usually raise more questions that it answers, The specific report excerpt above was associated with a case of duplicate agent names. When the agent configurations were changed so each agent had a unique name, as ITM expects, the agent stopped losing connection.

Summary

The Agent Diagnostic Log Communications Summart was presented.

Sitworld: Table of Contents

History and Earlier versions

There is a distribution here https://github.com/jalvo2014/logcomm which maybe be somewhat less tested than the point releases. If the current version of the Agent Diagnostic Log Summary tool does not work, you can try recent published binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

logcomm.0.63000
Handle instanced logs

logcomm.0.62000
Make KDE_TRANSPORT/KDC_FAMILIES check work on Windows

logcomm.0.61000
Add hostname/installer/gskit_level when cinfo.info is available

logcomm.0.60000
Add advisory for different CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME

logcomm.0.59000
Add in KDC_PARTITION checking – rare and usually an error

logcomm.0.58000
Add in ENV checking if the files are present

logcomm.0.57000
Add in system name and some CTIRA variables if present

logcomm.0.56000
Add Default host address to timeline

logcomm.0.55000
Advisory on mixed KDC_FAMILIES and KDE_TRANSPORT

logcomm.0.54000
Capture Port Scanning type messages

logcomm.0.53100

Collect data from RPC-Lost messages

Photo Note: Cruise Ship Energy Storage – 2017

 

Sitworld: Adventures in Communications #1

barn_swallows_fog

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 5 July 2018 – Level 1.00000

Follow on twitter

Inspiration

There have been a lot of challenging to solve communication issues recently. It is interesting to look at the issues and the resolutions.

This symptom was a stalled remote TEMS that hardly did any communication. Restarting the remote TEMS resolved the issue for a day or so but eventually it got stuck again.

This was seen in the TEMS Audit

Advisory:  99,TEMSAUDIT1088W,TCP,TCP Queue Delays 22 Send-Q [max 66131] Recv-Q [max 9448] – see Report TEMSREPORT051

These means that 22 TCP sockets were showing a non-zero buffer usage. The maximum Send-Q buffer was 66131 bytes.

In the Report051 section:

f1000e0005d0cbb8 tcp4       0  66131  159.202.134.25.65100  129.39.95.12.39482    ESTABLISHED

So the local address was a Warehouse Proxy Agent [WPA or HD] and the target was some system where agents were running. After reviewing the hub TEMS database it appeared thata Tivoli Log Agent and also a Summarization and Pruning agent were running on that system

   stlafiitm11:SY                   Y        SY      06.23.03 ip.spipe:#129.39.95.12[15949]<NM>stlafiitm11</NM>

   SCAN:stlafiitm11:LO              Y        LO      06.30.00 ip.spipe:#129.39.95.12[7757]<NM>stlafiitm11</NM>

This report section comes from a netstat -an capture. There were more high buffer values. In such cases usually one is the culprit and the rest are victims. High Send-Q buffer is almost always the key indicator. You look at the foreign address – an agent system – and review that system. If that also has high Send-Q/Recv-Q values, it needs a closer look. We suggest stopping all the ITM agents on systems which the netstat -an sees as high Send-Q. High [more than 8192 bytes]. After stopping all the agents, recycle the affected TEMS. Ideally you review the potential problem agent systems but you could just start each agent up one at a time and watch for issues

So What is the big Deal?

Well running systems never show large Send or Receive buffer bytes pending. Whatever is there is always transient. When there is a lot of bytes pending, the buffer for doing new TCP work is exhausted and no new communications can proceed. This is a definite worst case and the condition often persists until the TEMS is recycled. In the meantime all the agents go offline as well as the TEMS. So it really is a bad condition. There is no monitoring going on at those agents and recovery is disruptive. Monitoring is degraded.

The ITM TEMS is a real time system that defers to other processes – does its best to be a good neighbor. If there is a batch process using a LOT of TCP, then the TEMS can be blocked out for long periods – and sometimes until it is recycled. If this happens at an agent, a normal ITM process like TEMS can attempt contect and be blocked itself. If this was the agent side TCP issues backs up and locks up the TEMS the agent  is connected to. When the buffer space used is full, the TEMS itself is logically blocked and unable to work properly.

What was happening HERE!!

In this case there was a Summarization and Pruning agent running. This is a vital service when you are collecting historical data . Without it the storage space would grow and grow “forever”. The S&P agent was configured with 8 threads. That meant at when it was operational [at 2am in the morning for several hours] S&P would dominate all the TCP communications. It was running as a batch process working as fast as it could. The ITM communications were blocked out. The TEMS and WPA services attempted to communicate. That could not continue since the Agent side system was blocked up. And this the TEMS/WPA services were totally blocked. In the end the TEMS/WPA needed to be recycled. And the next night the same risk was present.

You might see the same thing happening on a system with a WPA. One reviewed recently was using 30 threads and it was running on the same system as the hub TEMS. Reducing that to 4 threads [the system had 8 cores] eliminated the conflict. Better yet would be to configure WPAs at each remote TEMS so the WPA communication workload would be spread out and the hub TEMS WPA would have little network competition.

Solution NOW

The S&P agent was reconfigured to use only a single thread. It took a bit longer to complete overnight but now it “played nice” with communications and the TEMS/WPA ran smoothly.

Summary

Communications adventure #1 – caused by an over-active Summarization and Pruning agent.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Barn Swallows in the Fog – Big Sur 2010

 

Sitworld: ITM Port Usage and Managing Port Usage

mud

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #2 – 22 July 2018 – Level 2.01000

Follow on twitter

Inspiration

I get asked questions ITM port usage. Years ago I wrote an extensive technical document and published it as a technote ITM Port Usage and Limiting Port Usage. The answers have a lot of complexity. Since the first technote several new aspects have be researched and this is release 2.0 of that document. A version 3.0 is not impossible!

Question

How do you Limit Port Usage in IBM Tivoli Monitoring Version 6?

Answer

Overview

ITM uses TCP/IP [ see Note 1 for exceptions] to communicate between and within ITM processes. It uses port numbers to perform this activity. Customers who control port usage closely have questions about what ports are used and also what options are available to control and limit those ports. This document explains all these questions. If there is a topic you know well, please skip to the later sections which explain exactly how to control the ports. The presentation uses standard default port numbers but almost anything can be configured.

 

Terminology

In TCP/IP communications a port is a number from 0 to 65535. A process can ask to be notified if another process – on this or another server – wants to connect. A call to the TCP bind routine links the incoming communication to the listening process.

A network interface is a hardware and/or software construction that has one or more IP addresses. That can be IPV4 or IPV6 addresses. In simple systems there is typically a single hardware part which implements the network interface and usually a software network interface for localhost or 127.0.0.1, which is only known locally. Servers may have many network interfaces and a single hardware interface can even have multiple IP addresses.

When an external process connects to the listening port, that produces a socket connection. The connection can be read or written from either side and the TCP/IP software transmits the data.

 

ITM Usage of TCP/IP

ITM does not control outbound communication. It writes data to a target ip address and port number and lets TCP/IP calculate the best way to do that work. All TCP/IP environments have “route” commands which lets the system administrator control what network interface gets used for the communication.

By default, ITM listens on all interfaces using an anonymous BIND call. The KDEB_INTERFACELIST [and KDEB_INTERFACELIST_IPV6] environment variable can be used to force an exclusive bind. Remember you cannot mix anonymous and exclusive binds for ITM usage in a single system. See Sitworld: ITM Protocol Usage and Protocol Modifiers for all the details.

 

ITM and Location Broker

Each ITM process uses a communication environment variable KDE_TRANSPORT [mostly z/OS] or  KDC_FAMILIES [mostly Linux/Unix/Windows/i5] which names the protocols supported and uses protocol modifiers. See this above ITM Protocol document for details. Here is a relatively simple example connection string as shown in a TEMS diagnostic log

  • KDE_TRANSPORT=KDC_FAMILIES=”ip.pipe port:1918 ip use:n ip.spipe use:n sna use:n HTTP:1920

That text is composed from the data set by the user during an ITM configuration.

The incoming ports on hub and remote TEMSes are owned by a Location Broker. During that initial connection, information is provided which allows the connecting program to look up a service and the ip address, port number and other items needed for a connection. Location Brokers run at each TEMS and a hub TEMS maintains a master list – the Global Location Broker.

When a service registers, that information is added to the Location Broker data. For example a TEMS will register all the available network interfaces. If KDEB_INTERFACELIST is supplied, the ip addresses listed will be registered with those as first priority. In this way a service can tell a user what ip address and port to use.

The important thing to remember is that services register and users look up the data. The decentralized approach makes the process more resilient and high performance because there are fewer choke points.

 

ITM Port Usage – Agents

In a default configuration, agents use the following ports,

1) Connection to a TEMS base port – for example the ip.pipe protocol default port is 1918. The communications string defines the protocols used. The CT_CMSLIST environment variable names the servers where a TEMS may be running. The initial connection at port 1918 gives access to the Location Broker data and enough information to call TEMS routines. However from the standpoint of configuration, this is a socket to a 1918 listening port on the server running the TEMS.

2) The initial agent listening port at 1918+N*4096. If there is just one agent installed, the listening port will be 1918+4096 or 6014. If there are more than one agent installed, the agents contend for the listening ports. That means incidentally that there are a maximum of 15 agents using the default configuration. The listening port is used for several purposes including retrieval of real time data and receiving broadcasts about new WPA address.

3) The first working CT_CMSLIST entry that works is recorded as the primary TEMS for the agent. That is important later when Fallback to Primary logic is used.

4) The agent listening port follows that rule for first contact. BUT! BUT! BUT! if communications is lost to the TEMS and later reconnected, a new listening port is acquired from TCP. The fundamental Internet RFC documents require that – to handle late arriving packets or packet fragment smoothly.  That port gotten from TCP is known as a temporary or ephemeral port. [Ephemeral is from Greek and means “lives for just one day”]. TCP hands out temporary ports based on platform rules [example Linux temporary ports are always greater than 1024]. All TCP implementations also make use of a service file which states which ports must be reserved for expected processes. If every process followed those rules there could never be a conflict!! In any case, an agent listening port can be almost anything after it has been running for a while. ITM is a dynamic system which can managed 20,000 agents or more and dynamic environment means things change.

5) Most agents connect to a Warehouse Proxy Agent or WPA. This can be at any port but is now defaulted to 63358 = 1918+15*4096 [or 51100 = 3660+15*4096 for ip.spipe]. The agent determines the contact information found from the Location Broker. The WPA registers at the hub TEMS, the Global Location Broker information is propagated to the Local Location Broker at the remote TEMS and the agent looks it up as part of the Location Broker data.

6) By default all Agents and most ITM processes start an internal web server. The default ports used are 1920 and 3661 [http and https]. There are several purposes including the ability to start and stop diagnostic tracing dynamically. The first ITM process owns the 1920/3661 and the others register with it. If that process stops the 1920 ownership switches to another ITM process if possible.In addition each internal web server will have two additional ephemeral listening ports. These are used to implement a single web image view. When you connect to the internal web server, you get to a service index page which puts all the resources from all the internal web services in one spot. See Note 2 at the end on how that neat trick is performed. Many ITM installation prefer to turn the internal web server off – exactly how explained later.

7) Ephemeral ports. ITM makes use of ports which are received from TCP/IP as “the next free port”. These are used to communicate between ITM sub-systems. You can control what ephemeral ports are allocated using the POOL Protocol Modifier.

8) Localhost ports. These are on  127.0.0.1 which are not internet capable addresses. They are used to maintain awareness between ITM processes such as handling the internal web server switch process. At ITM 630 FP6 there is an environment variable to control what port number are used.

9) Each TCP socket connection uses two ports. One is the well known port like the TEMS 1918. The other is a temporary or ephemeral port which is never in listen mode. You can see it in a netstat -an output [local and foreign targets]. It is used to implement the dual-fifo asynchronous pipe logic which is TCP sockets.

Almost every single number you see in the above description is configurable except for 4096. So if you need to make a change you can always make that change. The rest of the technote shows how to limit or control port usage.

You can never firmly tie down what ports will be used by any agent long term. You can control initial configuration but then things can drift. In fact the TEMS Audit report uses the fact that an agent got multiple listening ports as a signal that the agent is having communication or configuration issues. Even within the TEMS, you can see that there is no long term socket connection to an agent. If there is a need for real time data for example, the ip address and port are looked up at the time needed. See Note 3 about what the TEMS is REALLY using to communicate… it isn’t always an ip address.

 

ITM Port Usage – Servers

In most ways servers like TEMS/TEPS/S&P/WPA are also agents. You use the same communications string to control.

Hub TEMS and TEPS must have the internal web server present for normal operations.

The remote TEMS have an added control named gbl_site.txt which identifies which TEMS the TEMS connects to. That is unrelated to this technote.

If Fault Tolerant Option is configured, each hub TEMS connects to the other one using the correct base port.

 

Changing the Agent Configuration

To implement changes you need to make a permanent change to the Agent configuration. If you make a communications string change to one ITM process it often has to be made to all ITM processes on that server. Here are two technotes that help explain that process. The first relates to just changing the communications string:

 

 

The second is a more general technote which has been vetted by L3 and Development as valid long term.

 

Limiting ITM Agent Internal Web Server ports.

The internal web server start can be suppressed. To do that just add

 

  • http_server:n

 

to the start of the communications string.

This will eliminate the ports associated with the internal web server:  1920/3661 ports and also the two ephemeral ports which mirror 1920/3661.

Alternatively, you can also eliminate specific HTTP ports by adding HTTP:0 or HTTPS:0 protocol modifiers to the communications string.

NOTE: The internal web server is required on the TEPS and the hub TEMS, otherwise important ITM functions will fail.

 

Limiting ITM listening ports

If you add EPHEMERAL to the connection string like this

  • ip.pipe port:1918 EPHEMERAL:Y use:y

 

then the agent will not use an open listening port or a WPA port. There is a limitation: if historical data is going to be collected either 1) a WPA is required on the TEMS the agent reports to or 2) historical data collection must be at the TEMS. This can actually be configured at the TEMS by adding EPHEMERAL:INBOUND in to the TEMS connection string.

This also makes setting up connections through firewalls easier since only the base port 1918 needs to be permitted. There is no significant performance impact with this choice. This modifier can be made to one agent but not another on the same server.

The diagnostic log will reference internal virtual ports [like 6015 for example] but these are invisible to TCP.

 

Controlling Ephemeral ports

These ports cannot be eliminated but you can configure them to certain number ranges using the POOL protocol modifier as described here:

 

Sitworld: ITM Protocol Usage and Protocol Modifiers

Universal Agent Ports

The Universal Agent is an ITM component that lets you extend ITM by writing your own agents.

UA uses all the same ports and by default will also use port 1919 to communicate with collectors [IANA registered]. Each data collector process will use an ephemeral port to form the socket is created.

KUMP_LOCAL_DATA=Y configures non-socket communication on a single server. In a very few cases that configuration causes collection issues.

Please consider use of Agent Builder instead which is being actively developed.
The tacmd createNode Function uses Ports

The tacmd createNode function is largely implemented by a java program running at the hub TEMS. It listens for work on [default] port 1978, a specific bind to the ip address of the system where the java program and TEMS are running. This port can be altered using the TEMS environment variable KDY_MANAGE_PORT and would usually be set in ms.ini or in ms.environment persistent configuration override file.

This function gets used for the first install of OS Agent on a system. Linux/Unix uses SSH/RSH/REXEC from the hub TEMS to the target agent. For example, SSH usually uses port 22. During agent createNode processing the service port and port 1918 from agent to hub will be used. Afterwards the agent will usually connect to a remote TEMS.

Summary

This document describes ITM port usage and shows ways you can eliminate or control port usage.

Sitworld: Table of Contents

 

Note 1

In z/OS the SNA communications can be used for communication. This document does not apply to that option. The Portal Client can use two communication techniques: http/https and then CORBA communications, which is also unmentioned in this document. The ITM EIF facility and the LDAP facility can use ports not described here.

 

Note 2 – Multiple Internal Web Servers

This discussion is only for http/https at port 1920/3661. In fact the logic  the IPV6 equivalents. Suppose there are three agents starting up at the same time. They all attempt to bind to the 1920 and 3661 listening ports. One succeeds and owns that listening ports. The failing two internal web servers make a connection to the winner and they also register the two ephemeral ports which parallel 1920 and 3661. The winner accumulates that data and generates the service index page. For example, if you click on a certain service on the index page, you might well be redirected to the the ephemeral parallel port on another internal web server associated with another agent.

Should the agent running the winner be recycled, the other two agents notice that their connection to the winner have failed. At that point they “start all over” and one of them becomes the winner and owner of port 1920/3661. When the original winner agent starts again, it attempt to get 1920/3661, fails, and then registers like all good losers.

In that way a single image is preserved. It all works just fine unless there are firewall rules that allow only specific ports. You can get to the 1920 port, lets say, but when you click on a service, you are directed to some ephemeral port and the firewall probably blocks it.

As usually there is considerable flexibility. As seen in the Protocols blog post you can use HTTP:nnnn and HTTPS:nnnn specify different listening ports. Incidentally, the ITM ports 1918/1919/1920/3660/3661 are registered with the IANA – Internet Assigned Numbers Authority. That tends to limit accidental conflicts.

 

Note 3 – The TEMS PIPE_ADDR control

In simple cases, TEMS appears to be using the agent ip address. But things are not always simple!

For example, an agent might be behind a Network Address Translation firewall. To connect to a TEMS is might have to use ip address 1.2.3.4 and port 4590. On the TEMS side access would be made to the TEMS ip address and port 1918. ITM Communications logic makes this happen with no configuration needed at all.

TCP socket logic is modeled and extended. Normal TCP socket is the functional equivalent of Unix dual-fifo asynchronous pipes. In effect either side can write to the other at any time. The TEMS PIPE_ADDR is a control that tracks how to communicate with an agent process. For the simplest of cases, it will be like an IP Address and all the rest of the information is default and mostly ignored. For the case of a incoming “beyond firewall” connection, an artificial address is used like 0.0.0.20 and some arbitrary port. In that case the TEMS side firewall address and port is recorded, as well as the beyond firewall agent side address and port… and of course some controls indicating what sort of translation is needed.

These PIPE_ADDR are specific to a TEMS. So a hub TEMS would have no idea how to contact an agent connected via a remote TEMS. The hub TEMS passes the work off and the remote TEMS has thy PIPE_ADDR information and knows what to do.

This is also used to handle agents configured with EPHEMERAL:Y – where an Agent=>TEMS socket connection multiplexes three different virtual tcp socket connections.

This is also used to handle agents connecting via KDE_Gateway logic – an more complex socket multiplexing solution. KDE_Gatway is needed with there is more that one NATing firewall router between the Agent and the TEMS. TEMS can handle just a single address translation link.

The last area I have seen this used is when a link on z/OS utilizes SNA communications. It translates between SNA and TCP transparently. There may be others.

This is complex but communications is often complex and ITM6 was built to work in that environment with minimal configuration.

 

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Big Sur 2018 – Highway 1 Restored 18 Months after Mud Creek Collapse

 

Sitworld: Event History #15 High Results Situation to No Purpose

backdeck

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 25 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Summary Section

Total Result Bytes: 1023369249 989.47 K/min Worry[197.89%]

This environment is receiving almost one megabyte of results data per minute. Experience has shown that problems often occur if the result rate is over 500K per minute. That is the source of the “worry” percentage. Your mileage may vary based on the server running the workload. Even if the server can handle the workload it is never a good idea to perform useless work.

This was seen in the Report011:  Event/Results Budget Situations Report by Result Bytes

EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT

deb_prccpu_xuxw_aix,UNIXPS,2784,60,978,13.46%,0.97,269576,750499584,73.34%,0,0,0,0,0,0,269576,750499584,0,0,978,39,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,

So there is a situation deb_prccpu_xuxw_aix which runs every 60 seconds and checks for one process and alerts the CPU% is between 1% and 4%. It runs on 39 agents connected to this remote TEMS.

Remarkably, this one situation causes an estimated  73.34% of the total estimated workload. This is an estimate because the data does not include information about situations which started before the Event Status History data. The actual result data can be higher because of real time data requests. 

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410002104999,1180410001843000,Y,60,1,,,3850,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,

Situation was deb_prccpu_xuxw_aix, agent was deb_gb02cap070debx7:KUX. thrunode was REMOTE_gbnhham080tmsxm. Agent_time was 1180410002104999 and TEMS_time was 1180410001843000, so the agent is running a few minutes ahead. It was an Open event [Y], sampling interval was 600, there was one result, there was no DisplayItem. The record came from line number 3850 in the input. The PDT is shown at the end.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=UNIXPS.CPUPERCENT >= 100 AND UNIXPS.UCOMMAND = N’/opt/BESClient/bin/BESClient’ AND UNIXPS.CPUPERCENT < 400,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time. 



,,,,,,,0,UNIXPS.ADDR=b424b459;UNIXPS.BCMD=BESClient;UNIXPS.CHILDSTIME=0;UNIXPS.CHILDTIME=0;UNIXPS.CHILDUTIME=0;UNIXPS.CMD=/opt/BESClient/bin/BESClient;UNIXPS.COMMAND=/opt/BESClient/bin/BESClient;UNIXPS.CONTSWIT64=460694;UNIXPS.CONTSWITCH=460694;UNIXPS.CPU=120;UNIXPS.CPUID=-1;UNIXPS.CPUPERCENT=144;UNIXPS.CPUTIME=62;UNIXPS.EGID=0;UNIXPS.EGRPN=system;UNIXPS.ELAPTIME=000d04:25:13;UNIXPS.EUID=0;UNIXPS.EUSERN=root;UNIXPS.EVENT=*;UNIXPS.EXECSTATE=A;UNIXPS.FLAG=   40001;UNIXPS.GID=0;UNIXPS.GRPN=system;UNIXPS.HEAP=-1;UNIXPS.INVCONTS64=11490;UNIXPS.INVCONTSWT=11490;UNIXPS.MAJORFAU64=9;UNIXPS.MAJORFAULT=9;UNIXPS.MEMPERCENT=100;UNIXPS.MINORFAU64=65168;UNIXPS.MINORFAULT=65168;UNIXPS.NICE=22;UNIXPS.ORIGINNODE=deb_gb02cap070debx7:KUX;UNIXPS.PGID=8847390;UNIXPS.PID=11206814;UNIXPS.PPID=1;UNIXPS.PRIORITY=64;UNIXPS.PROCCOUNT=1;UNIXPS.PSU=10792;UNIXPS.RDS=10792;UNIXPS.READWRI64=610219238;UNIXPS.READWRITE=610219238;UNIXPS.RTS=5177;UNIXPS.SCHEDCLASS=N/A;UNIXPS.SESSIONID=8847390;UNIXPS.SIZE=63876;UNIXPS.STACK=-1;UNIXPS.STARTTIME=1

180409195521000;UNIXPS.SYSTEMTIM=000d00:00:11;UNIXPS.SYSTEMTYPE=AIX;UNIXPS.TEXT_SIZE=15207;UNIXPS.THREADCNT=4;UNIXPS.TIME=00001:02;UNIXPS.TIMESTAMP=1180410002034000;UNIXPS.TOTALTIME=000d00:01:02;UNIXPS.TOTCPUPERC=38;UNIXPS.TTY=-;UNIXPS.UCMD=/opt/BESClient/bin/BESClient;UNIXPS.UCOMMAND=/opt/BESClient/bin/BESClient;UNIXPS.UID=0;UNIXPS.UPROCFILT=;UNIXPS.USERNAME=root;UNIXPS.USERTIME=000d00:00:51;UNIXPS.UUSERNAME=root;UNIXPS.VSIZE=58372;UNIXPS.WAITCPUTIM=0;UNIXPS.WAITLKTIME=0;UNIXPS.WLM_NAME=Unclassified;UNIXPS.WPAR_NAME=Global;UNIXPS.ZONEID=-1;UNIXPS.ZONENAME=-1,

Here is where I extracted the value result. This is raw data and represents 1.44%.



==> UNIXPS.CPUPERCENT=144  

I will skip repeating the  full details. Next you see the results coming in false and then true. Each time it is true I record the value.

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410003604999,1180410003343000,N,60,0,

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013304999,1180410013043000,Y,60,1

==> UNIXPS.CPUPERCENT=336

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013504999,1180410013243000,N,60,0,

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020704999,1180410020444000,Y,60,1

==> UNIXPS.CPUPERCENT=110

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020804999,1180410020543000,N

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410043404999,1180410043143000,Y,60,1

==> UNIXPS.CPUPERCENT=121

What is the problem and How to fix it?

From this capture, 13.46% of the events and 73.34% of the result workload was from this situation. This from only 39 agents!

Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is “fixing” the condition.

So this situation is not rare, not exceptional and clearly no one is “fixing” it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers. If it could be stopped, the workload would drop substantially. Thus the situation should be reviewed and justified.

Summary

Tale #15 of using Event Audit History to understand and review a high overhead situation and thus potentially save resources.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Back Deck – Big Sur 1999

 

Sitworld: Event History #14 Lodging Problems

Summer2013

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 21 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was about diagnosing a situation that produces a lot of results and what that means and what to do. Then it became more complicated and interesting.

This was seen in the Advisory Report

100,EVENTAUDIT1008E,TEMS,Situations [2] had lodge failures [2] – See report EVENTREPORT020

What could that mean?

This was seen in the Report020: Deltastat X (Problem) Report

Situation,Count,

UADVISOR_O4SRV_TEIBLOGT,1,

bnc_filnp_wntp_bncfmm_em34,1,

Before a situation can run at an agent, the defining SQL statement must be compiled. When that compile completes the information is said to the “Lodged”.

This report shows cases where the compilation failed. The most common reason is when the related application support is back level or even missing. There are less common cases where the situation creator has manually created the formula and the loaded with a tacmd createSit or tacmd EditSit…. and the formula had a syntax error. Normally this is impossible since situations are expected to authored in the TEP Situation Editor which works to ensure correct syntax. However it does happen.

What are the problems and How to fix them?

The named situations are not running. In this case it includes a historical data collection against a TEMS database table.

Review the TEMS diagnostic log and discover what attributes are in question and correct the condition.

You can use the TEP itself to locate most of the problems centrally.

In a Portal Client session. From the Enterprise navigation node

1) right click on Enterprise navigation node

2) select Managed Tivoli Enterprise Management Systems

3) In bottom left view, right Click on workspace link [before hub TEMS entry] and select Installed Catalogs

4) In the new display on right, right click in table, select Properties, click Return all rows and OK out

5) Resolve any missing or out of data application data. You can right-click export… the data to a local CSV file for easier

tracking.

This process reconciles the catalog files, however catalog and attribute files are paired and so both get corrected.

Summary

Tale #14 of using Event Audit History is about identifying and correcting situations which are not running because of SQL compile problems.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Big Sur backdoor – Summer 2013

 

Sitworld: Event History #13 Delay Delay Delay

SouthOfNepentheinBigSur

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 10 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was about diagnosing a situation that produces a lot of results and what that means and what to do. Then it became more complicated and interesting.

This was seen in the Summary Report

Delay Estimate opens[110] over_minimum [6] over_average [3.33 seconds]

There were 110 Situation Event Open conditions. From these the minimum delay from that each agent was calculated. That means the time in seconds between the Agent Time and the TEMS time. In a quick local network that might always be 0 or 1 second. This technique is used to judge delays in the face of time zone differences and time setting differences. In this case 6 Situation Open Events took more than the minimum time and  the average time over minimum was 3.33 seconds.

This was seen in the Report016: Delay Report by Node and Situation

Node,Situation,Atomize,Delay,Min_delay,GBLTMSTMP,Line,

RZ:multip-multip-vaathmr406:RDB,wlp_fretbsp_grzc_std,,4,0,1180429231916015,2757,

RZ:multip-multip-vaathmr406:RDB,wlp_fretbsp_grzw_std,,3,0,1180429225815145,8103,

RZ:multip-multip-vaathmr406:RDB,wlp_fretbsp_grzw_std,,5,0,1180429235517039,2645,

RZ:multip-multip-vaathmr406:RDB,wlp_tbspro_grzc_std,,6,0,1180429235520040,5155,

You can see the minimum delay was 0 seconds. In the report here the observed delay was 4/3/5/6 seconds. The TEMS timestamp is shown along with the line number of the data file which had the information.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

wlp_fretbsp_grzc_std,RZ:multip-multip-vaathmr406:RDB,REMOTE_va10plvtem021,1180429231912000,1180429231916015,Y,180,5,,,2757,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 5.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

It is an Oracle Agent target looking for free to allocate less than or equal to 5%. Note that it does not have any DisplayItem. The is measured here where the Agent_Time was 180429231912000 and the TEMS time was 4 seconds later 1180429231916015

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=STRSCAN(KRZTSNLUE.TNAME, N’TEMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDO’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’RBS’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’ROLLBACK’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’FNTMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’2013′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS1′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS2′) = 0 AND KRZTSNLUE.PCIFREE <= 00500  AND KRZTSOVEW.STATUS <> N’READ’,

,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and 5 result lines.

,,,,,,,0,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=512000;KRZTSNLUE.KBUSED=95744000;KRZTSNLUE.KBYTES=96256000;KRZTSNLUE.KLARGEST=512000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=500;KRZTSNLUE.MBUSED=93500;KRZTSNLUE.MBYTES=94000;KRZTSNLUE.MLARGEST=500;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9929;KRZTSNLUE.PCIFREE=53;KRZTSNLUE.PCIMAX=72;KRZTSNLUE.PCIUSED=9947;KRZTSNLUE.TIMESTAMP=1180429231627000;KRZTSNLUE.TNAME=CASOL_CH201706_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,1,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=2150400;KRZTSNLUE.KBUSED=83865600;KRZTSNLUE.KBYTES=86016000;KRZTSNLUE.KLARGEST=2150400;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=2100;KRZTSNLUE.MBUSED=81900;KRZTSNLUE.MBYTES=84000;KRZTSNLUE.MLARGEST=2100;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9938;KRZTSNLUE.PCIFREE=250;KRZTSNLUE.PCIMAX=64;KRZTSNLUE.PCIUSED=9750;KRZTSNLUE.TIMESTAMP=1180429231627000;KRZTSNLUE.TNAME=CASOL_CH201706_IDX;KRZTSNLUE.TSFGNUM=1,

,,,,,,,2,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=4096000;KRZTSNLUE.KBUSED=86016000;KRZTSNLUE.KBYTES=90112000;KRZTSNLUE.KLARGEST=4096000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=4000;KRZTSNLUE.MBUSED=84000;KRZTSNLUE.MBYTES=88000;KRZTSNLUE.MLARGEST=4000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9936;KRZTSNLUE.PCIFREE=455;KRZTSNLUE.PCIMAX=67;KRZTSNLUE.PCIUSED=9545;KRZTSNLUE.TIMESTAMP=1180429231627000;KRZTSNLUE.TNAME=CASOL_CH201707_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,3,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=2764800;KRZTSNLUE.KBUSED=77107200;KRZTSNLUE.KBYTES=79872000;KRZTSNLUE.KLARGEST=2764800;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=2700;KRZTSNLUE.MBUSED=75300;KRZTSNLUE.MBYTES=78000;KRZTSNLUE.MLARGEST=2700;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9943;KRZTSNLUE.PCIFREE=346;KRZTSNLUE.PCIMAX=60;KRZTSNLUE.PCIUSED=9654;KRZTSNLUE.TIMESTAMP=1180429231627000;KRZTSNLUE.TNAME=CASOL_CH201707_IDX;KRZTSNLUE.TSFGNUM=1,

,,,,,,,4,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=512000;KRZTSNLUE.KBUSED=95744000;KRZTSNLUE.KBYTES=96256000;KRZTSNLUE.KLARGEST=512000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=500;KRZTSNLUE.MBUSED=93500;KRZTSNLUE.MBYTES=94000;KRZTSNLUE.MLARGEST=500;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9929;KRZTSNLUE.PCIFREE=53;KRZTSNLUE.PCIMAX=72;KRZTSNLUE.PCIUSED=9947;KRZTSNLUE.TIMESTAMP=1180429231627000;KRZTSNLUE.TNAME=CASOL_CH201708_DAT;KRZTSNLUE.TSFGNUM=1,

The distinguishing attribute is KRZTSNLUE.TNAME. For example result zero has CASOL_CH201706_DAT and result 1 has CASOL_CH201706_IDX, and so on.

What are the problems and How to fix them?

The first problem is that there are five results and they are being merged – so four are lost. That could be resolved by configuring DisplayItem to KRZTSNLUE.TNAME.

The time delay of several seconds is a hint that the TEMS involved is overloaded, the network might be slow at times, or the agent may be impacted by another process. It isn’t a big issue like missing situation events but it should be considered and checked.

Summary

Tale #13 of using Event Audit History is about reviewing a case where a situation was observed with delays and that led also to a case of missing situation events because no DisplayItem was configured.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Looking South from Nepenthe Restaurant – Big Sur 2003

 

Sitworld: Event History #12 High Impact Situations And Much More

OpenAirPool

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 1 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was about diagnosing a situation that produces a lot of results and what that means and what to do. Then it became more complicated and interesting.

This was seen in the Summary Report

Total Result Bytes: 11698309 49.38 K/min Worry[9.88%]

49.38K bytes per minute incoming to a hub TEMS is not usually something to worry about. The standard worry point is 500K bytes per minute. Even so it is always interesting to review situations which are dominating and can be rethought. If case you ever wondered, the maximum incoming result rate ever seen was 93 megs/minute and 128 remote TEMS were crippled.

This was seen in the Report011: Event/Results Budget Situations Report by Result Bytes

EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT

wlp_fretbsp_grzw_std,KRZTSOVEW,384,180,13,7.26%,0.06,14415,5535360,47.32%,13986,5370624,0,0,0,0,422,162048,0,0,13,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

For this discussion, we look at column 8,9 and 10.

8 Results                             14415

9 ResultBytes                     5535360

10 Result%                        47.32%

22 Nodes                           1

These are estimated based on the Event History Status table. If the TEMS has been running a long time and the event history table has wrapped, there could well be unseen workload components. You need a TEMS Audit report to directly measure incoming result workload to be accurate.

This one situation wlp_fretbsp_grzw_std was observed running on a single agent and was producing almost half the incoming result workload. Again the impact isn’t high but it does seem unusual and needs further analysis. The formula excludes a lot of table types and then looks for a percent free to allocated  between 5% and 10%. That is the sort of formula that may violate the general rule of good situations – Rare/Exceptional and Fixable. The fact there are so many positive results strongly suggest that it is neither Rare nor Exceptional  and that no one is fixing the condition. More Reports that will supply more details.

This was seen in the Report018: Situations processed but not forwarded

EVENTREPORT018: Situations processed but not forwarded

Situation,Count,Nodes,

wlp_fretbsp_grzw_std,13,1,

This means that some events were sent to an event receiver, but not this one. That means that 47% of the TEMS workload provided monitoring for a condition that was not sent to an event receiver. That might explain why no one has fixed the condition. There was a very bad condition once where the hub TEMS was suffering and 80% of the incoming events created by one situation on one AIX Unix OS Agent running on a decommissioned AIX system that was supposed to be powered off. This isn’t always a wrong condition, of course, but anytime you have a non-forwarded situation event it needs to be reviewed and justified.

This was seen in the Report004:  Situations with Multiple results at TEMS with same DisplayItem at same second

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,TEMS_Second,Results,Agent,Atomize,Atom,

wlp_fretbsp_grzc_std,Sampled,1180429225812000,5,RZ:multip-multip-vaathmr406:RDB,,,

wlp_fretbsp_grzc_std,Sampled,1180429231916000,5,RZ:multip-multip-vaathmr406:RDB,,,

Here we see that Situation Events are being produced and each time there are 5 results. The Atomize and Atom are null which means that no DisplayItem has been set. This also means that events are being lost and never seen. Of course since the situation is not forwarded to an event receiver it may not matter. More information in the detail report later.

This was seen in the Report026:  Situations showing high Open<->Close rate

EVENTREPORT026: Situations showing high Open<->Close rate

Situation,Reeval,Rate,Node_ct,PDT

wlp_fretbsp_grzw_std,180,3.37,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

We see here that the situation has a re-evaluation rate of 180 seconds, 3 minutes. Every hour there are an average of 3.37 open->close or close->open transitions [per agent]  and  only one agent/node is involved.

Situations that show a rapid rate of opening and closing are suspect. It usually means they fail the basic test of Rare/Exceptional. How can a condition be rare if it shows a new open event multiple times an hour. This needs to be examined for reasonableness. The lack of DisplayItem may be causing this effect. When DisplayItems are configured, each result can separately control a situation event and that means there is less internal TEMS confusion. However since the Situation is not forwarded, this may be just internal friction, wasting resources to no benefit.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

wlp_fretbsp_grzw_std,RZ:multip-multip-vaathmr406:RDB,REMOTE_va10plvtem021,1180429225812000,1180429225815145,Y,180,5,,,8103,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

This is mostly what we have seen before. We do get to see the TEMS and Agent Time. We can also see that this was seen at line 8103 of the input.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=STRSCAN(KRZTSNLUE.TNAME, N’TEMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDO’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’RBS’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’ROLLBACK’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’FNTMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’2013′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS1′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS2′) = 0 AND KRZTSNLUE.PCIFREE > 00500 AND KRZTSNLUE.PCIFREE <= 01000  AND KRZTSOVEW.STATUS <> N’READ’,

,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and 5 result lines.

,,,,,,,0,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=7168000;KRZTSNLUE.KBUSED=131072000;KRZTSNLUE.KBYTES=138240000;KRZTSNLUE.KLARGEST=7168000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=7000;KRZTSNLUE.MBUSED=128000;KRZTSNLUE.MBYTES=135000;KRZTSNLUE.MLARGEST=7000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9902;KRZTSNLUE.PCIFREE=519;KRZTSNLUE.PCIMAX=103;KRZTSNLUE.PCIUSED=9481;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,1,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=26214400;KRZTSNLUE.KBUSED=466227200;KRZTSNLUE.KBYTES=492441600;KRZTSNLUE.KLARGEST=26214400;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=25600;KRZTSNLUE.MBUSED=455300;KRZTSNLUE.MBYTES=480900;KRZTSNLUE.MLARGEST=25600;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9653;KRZTSNLUE.PCIFREE=532;KRZTSNLUE.PCIMAX=367;KRZTSNLUE.PCIUSED=9468;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_IDX;KRZTSNLUE.TSFGNUM=1,

,,,,,,,2,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=17408000;KRZTSNLUE.KBUSED=234496000;KRZTSNLUE.KBYTES=251904000;KRZTSNLUE.KLARGEST=17408000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=17000;KRZTSNLUE.MBUSED=229000;KRZTSNLUE.MBYTES=246000;KRZTSNLUE.MLARGEST=17000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9825;KRZTSNLUE.PCIFREE=691;KRZTSNLUE.PCIMAX=188;KRZTSNLUE.PCIUSED=9309;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201612_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,3,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=11264000;KRZTSNLUE.KBUSED=206848000;KRZTSNLUE.KBYTES=218112000;KRZTSNLUE.KLARGEST=11264000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=11000;KRZTSNLUE.MBUSED=202000;KRZTSNLUE.MBYTES=213000;KRZTSNLUE.MLARGEST=11000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9846;KRZTSNLUE.PCIFREE=516;KRZTSNLUE.PCIMAX=163;KRZTSNLUE.PCIUSED=9484;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201702_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,4,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=25395200;KRZTSNLUE.KBUSED=462131200;KRZTSNLUE.KBYTES=487526400;KRZTSNLUE.KLARGEST=25395200;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=24800;KRZTSNLUE.MBUSED=451300;KRZTSNLUE.MBYTES=476100;KRZTSNLUE.MLARGEST=24800;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9656;KRZTSNLUE.PCIFREE=521;KRZTSNLUE.PCIMAX=363;KRZTSNLUE.PCIUSED=9479;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201705_IDX;KRZTSNLUE.TSFGNUM=1,

Reviewing the attributes, it appears that the differentiating attribute  is  KRZTSNLUE.TNAME which takes on five different values – the last is CASOL_CR201705_IDX. 

What are the problems and How to fix them?

The first problem is that there are five results and they are being merged – so four are lost. That could be resolved by configuring DisplayItem to KRZTSNLUE.TNAME.

The second problem is that the Situation event goes open and close 3.17 times per hour. This means it is not Rare/Exceptional and so the formula needs to be reviewed for reasonableness. Most important who is going to fix the condition. If not one is going to fix the condition, the situation should be stopped and deleted. The best performance gain comes from not doing unnecessary work.

The third problem is that the Situation is not forwarded. This is only alerted when some situations are forwarded but not this. This is often seen when a situation is being developed [and tested on a single system] and then forgotten. There can also be good reasons for not-forwarding such as when used by a Workflow Policy or in an *UNTIL/*SIT control to block other situations. However it is most often an oversight and should be stopped and deleted.

Summary

Tale #12 of using Event Audit History is about reviewing a case where half the TEMS workload is coming from a single situation

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Open Air Pool – Cruise Ship Build 2015

 

Sitworld: Event History #11 Detailed Attribute differences on first two merged results

RadarDomeLift

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 27 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

One of the largest difficulty was understanding what happened when two [or more] results were merged in a single event. There are so many attribute values to compare it can be tedious. This needed a new report section!!

This was seen in the Event Audit History Report Section

EVENTREPORT007: Detailed Attribute differences on first two merged results

Situation,Node,Agent_Time,Reeval,Results,Atom,Atomize,Attribute_Differences

bnc_wasaftergc_gynp_dseiiwas,PWNESTBN:bnc_itmaxpapnestba:KYNS,1180410062604000,300,2,KYNGCCYC.SERVER_NAM,PWNESTBA,KYNGCCYC.AF_NO 1[31178] 2[31128],KYNGCCYC.BYTE_FREED 1[1793354] 2[1786754],KYNGCCYC.BYTE_USED 1[303798] 2[310398],KYNGCCYC.FINAL_REFS 1[826] 2[1165],KYNGCCYC.GC_NO 1[31179] 2[31129],KYNGCCYC.GC_TIME 1[1180410062522599] 2[1180410062451969],KYNGCCYC.HEAP_AVAIL 1[1793354] 2[1786754],KYNGCCYC.SOFT_REFS 1[4] 2[0],KYNGCCYC.TIME_COMP 1[441] 2[569],KYNGCCYC.TIME_MARK 1[172] 2[169],KYNGCCYC.WEAK_REFS 1[340] 2[412],,

This involved the  bnc_wasaftergc_gynp_dseiiwas situation which was delivered from agent PWNESTBN:bnc_itmaxpapnestba:KYNS and the Agent time was 1180410062604000. The sampling interval was 300 seconds [Sampled situation] and there were two results merged. There was a DisplayItem KYNGCCYC.SERVER_NAM and the Atomize value was PWNESTBA. That explains why they were merged.

Attribute by Attribute comparison.

At the end of each report line is a comparison between each attribute that is different between the first and the second result rows. If there were more than two results, this comparison is still only between the first two. The idea is to make it easier to compare the two. More comments after.

KYNGCCYC.AF_NO 1[31178] 2[31128],

KYNGCCYC.BYTE_FREED 1[1793354] 2[1786754],

KYNGCCYC.BYTE_USED 1[303798] 2[310398],

KYNGCCYC.FINAL_REFS 1[826] 2[1165],

KYNGCCYC.GC_NO 1[31179] 2[31129],

KYNGCCYC.GC_TIME 1[1180410062522599] 2[1180410062451969],

KYNGCCYC.HEAP_AVAIL 1[1793354] 2[1786754],

KYNGCCYC.SOFT_REFS 1[4] 2[0],

KYNGCCYC.TIME_COMP 1[441] 2[569],

KYNGCCYC.TIME_MARK 1[172] 2[169],

KYNGCCYC.WEAK_REFS 1[340] 2[412],,

Sometimes you can spot an attribute that would make a better DisplayItem, not here though.

The KYNGCCYC.GC_TIME is really interesting – selecting out the minute and second the first is 25:22 and the second in 24:51, about 31 seconds prior. Since the sampling interval is 300 seconds, these two result sets cannot be from the same agent… even though they have the same server name KYNGCCYC.SERVER_NAM of PWNESTBA. Next notice the agent name  PWNESTBN:bnc_itmaxpapnestba:KYNS. The first section is often the hostname and this is PWNESTBN – just one character away from PWNESTBA.

What is the problem and How to fix it?

The problem is that there are two results and they are being merged – so one is lost.

From the analysis above, there are two agents which have been accidentally configured with the same name and they are conflicting with each other. They are sending results every 300 seconds. The results arrive in a large collection area identified only by agent name [and situation name and DisplayItem and time etc]. The TEMS dataserver [SQL processor] wakes up every 300 seconds and looks for results for that situation. It finds them [two in this case] and creates a potential situation event package that SITMON [situation monitor logic] then bundles together.

The solution is to review the environment and determine what the duplicated agents are and correct the incorrect agent configuration. That way the agent names will be unique and when this happens again there will be two situation events created.

Often the knowledge of a potential duplicate condition and the agent name is enough to lead the agent owners to the correct ones to fix.

Other times these can be detected with a TEPS Audit report. Agents like this often send inconsistent node statuses – like a changing IP address. The TEPS is very sensitive and complains [produces error messages] and that TEPS Audit will summarize such complaints. Other times the hub and remote TEMS needs diagnostic tracing and a TEMS Audit report will point the way.

Summary

Tale #11 of using Event Audit History is about reviewing a case where there is evidence of duplicate situation names.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Radar Dome Lift – January 2016

 

Sitworld: Event History #10 lost events because DisplayItem missing or null Atoms

fittingEngine

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 24 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

25,EVENTAUDIT1010W,TEMS,Situations [1] lost events because DisplayItem missing or null Atoms – see EVENTREPORT001

One situations showed evidence that events were lost before DisplayItem was not configured or because the Atomize value was none. This is parallel to Event History #5 but from a different viewpoint.

And in that report section:

EVENTREPORT001: Multiple results in one second but DisplayItem missing or null atoms found

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

dow_evtlog_4ntw01_bkpv1,Pure,1180421001913000,3,Primary:roh_dewfs02:NT,,,

There is a pure situation dow_evtlog_4ntw01_bkpv1. At Agent second 1180421001913000 there were 3 total results from agent Primary:roh_dewfs02:NT. There was no DisplayItem or Atomize value

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

dow_evtlog_4ntw01_bkpv1,Primary:roh_dewfs02:NT,REMOTE_nltnzsdowm018,1180421001912000,1180421001913000,Y,0,4,,,7071,*IF ( ( *VALUE NT_Event_Log.Event_ID *EQ 34080 *AND *VALUE NT_Event_Log.Source *EQ ‘Backup Exec’ *AND *VALUE NT_Event_Log.Log_Name *EQ Application *AND *VALUE NT_Event_Log.Type *NE Information ) *OR ( *VALUE NT_Event_Log.Event_ID *EQ 34113 *AND *VALUE NT_Event_Log.Source *EQ ‘Backup Exec’ *AND *VALUE NT_Event_Log.Log_Name *EQ Application *AND *VALUE NT_Event_Log.Type *NE Information ) *OR ( *VALUE NT_Event_Log.Event_ID *EQ 57476 *AND *VALUE NT_Event_Log.Source *EQ ‘Backup Exec’ *AND *VALUE NT_Event_Log.Log_Name *EQ Application *AND *VALUE NT_Event_Log.Type *NE Information ) *OR ( *VALUE NT_Event_Log.Event_ID *EQ 57477 *AND *VALUE NT_Event_Log.Source *EQ ‘Backup Exec’ *AND *VALUE NT_Event_Log.Log_Name *EQ Application *AND *VALUE NT_Event_Log.Type *NE Information ) ),

Situation was Idow_evtlog_4ntw01_bkpv1, agent was Primary:roh_dewfs02:NT. thrunode was ,REMOTE_nltnzsdowm018. Agent_time was 1180421001912000 and TEMS_time was 1180421001913000, one second later. It was an Open event [Y], sampling interval was 0 meaning a pure situation, there was four results and no DisplayItem or Atomize.. The record came from line number 7071 in the input. The PDT is shown at the end

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=( ( NTEVTLOG.EVENTID = 34080 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 34113 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 57476 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 57477 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) ),

Following the predicate is three result lines and a second predicate line. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was two P lines and three result lines. 

,,,,,,,0,NTEVTLOG.CATEGORY=None;NTEVTLOG.COMPUTER=DEWFS02;NTEVTLOG.DESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.DUPCNT=0;NTEVTLOG.ENTRYTIME=1180420231906000;NTEVTLOG.EVENTDATE=04/20/18;NTEVTLOG.EVENTID=34113;NTEVTLOG.EVENTIDSTR=34113;NTEVTLOG.EVENTTIME=23:19:06;NTEVTLOG.LOGNAME=Application;NTEVTLOG.ORIGINNODE=Primary:roh_dewfs02:NT;NTEVTLOG.RECNUMBER=498366;NTEVTLOG.SOURCE=Backup Exec;NTEVTLOG.TIMESTAMP=1180420231907000;NTEVTLOG.TYPE=Error;NTEVTLOG.UCATEGORY=None;NTEVTLOG.UDESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.ULOGNAME=Application;NTEVTLOG.USERID=N/A;NTEVTLOG.USOURCE=Backup Exec;NTEVTLOG.UUSERID=N/A,

,,,,,,,1,NTEVTLOG.CATEGORY=None;NTEVTLOG.COMPUTER=DEWFS02;NTEVTLOG.DESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.DUPCNT=0;NTEVTLOG.ENTRYTIME=1180420231906000;NTEVTLOG.EVENTDATE=04/20/18;NTEVTLOG.EVENTID=34113;NTEVTLOG.EVENTIDSTR=34113;NTEVTLOG.EVENTTIME=23:19:06;NTEVTLOG.LOGNAME=Application;NTEVTLOG.ORIGINNODE=Primary:roh_dewfs02:NT;NTEVTLOG.RECNUMBER=498366;NTEVTLOG.SOURCE=Backup Exec;NTEVTLOG.TIMESTAMP=1180420231907000;NTEVTLOG.TYPE=Error;NTEVTLOG.UCATEGORY=None;NTEVTLOG.UDESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.ULOGNAME=Application;NTEVTLOG.USERID=N/A;NTEVTLOG.USOURCE=Backup Exec;NTEVTLOG.UUSERID=N/A,

Sometimes there are multiple captures bundled together in a single result row arrival. In that case you see a second P line.

,,,,,,,P,*PREDICATE=( ( NTEVTLOG.EVENTID = 34080 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 34113 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 57476 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) OR ( NTEVTLOG.EVENTID = 57477 AND NTEVTLOG.SOURCE = ‘Backup Exec’ AND NTEVTLOG.LOGNAME = ‘Application’ AND NTEVTLOG.TYPE <> ‘Information’ ) ),

,,,,,,,2,NTEVTLOG.CATEGORY=None;NTEVTLOG.COMPUTER=DEWFS02;NTEVTLOG.DESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.DUPCNT=0;NTEVTLOG.ENTRYTIME=1180420231906000;NTEVTLOG.EVENTDATE=04/20/18;NTEVTLOG.EVENTID=34113;NTEVTLOG.EVENTIDSTR=34113;NTEVTLOG.EVENTTIME=23:19:06;NTEVTLOG.LOGNAME=Application;NTEVTLOG.ORIGINNODE=Primary:roh_dewfs02:NT;NTEVTLOG.RECNUMBER=498366;NTEVTLOG.SOURCE=Backup Exec;NTEVTLOG.TIMESTAMP=1180420231907000;NTEVTLOG.TYPE=Error;NTEVTLOG.UCATEGORY=None;NTEVTLOG.UDESCRIP=Backup Exec Alert: Job Failed (Server: “DEWFS02”) (Job: “DEWFS02_FULL”) DEWFS02_FULL — The job failed with the following error: A communications failure has occurred.      For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml;NTEVTLOG.ULOGNAME=Application;NTEVTLOG.USERID=N/A;NTEVTLOG.USOURCE=Backup Exec;NTEVTLOG.UUSERID=N/A,

The P lines and the index lines are created by this event history report for easier viewing. In raw data they are all smashed together separated by tildes [~] and often embedded blanks.

What is the problem and How to fix it?

In this case there were three identical results and two were “lost” and only one situation was created. In the full report there were hundreds of identical results. One way to reduce the impact of all the results being created, transmitted and processed would be to configure the agent to suppress duplicate results. Another question that must be asked is whether the condition is going to be fixed. Is someone going to fix this condition or understand how to avoid it in the future. If you are going to get and process alert warnings, that needs to be someone’s responsibility. If not the situation could be suppressed by altering the formula or even stopped and deleted. The best performance gain is not doing work that isn’t needed.

Summary

Tale #10 of using Event Audit History is about reviewing a case where an flood of identical events arrive at a TEMS.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Will that Cruise Ship Engine Fit? 2014

 

Sitworld: Event History #9 Two Open Or Close Events In A Row

NestingSwallows

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 22 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

10,EVENTAUDIT1005W,TEMS,Situations [25] showing open->open transitions – see EVENTREPORT025

One situation showed a sequence of open event followed by an second open event. There is also a ,EVENTAUDIT1006W advisory showing a close followed by a close. As seen by the low impact of 10 this is unusual but often not very important.

And in that report section:

EVENTREPORT025: Situations showing Open->Open and Close->Close Statuses

Situation,Type,Count,Node_ct,Nodes,

ccp_cpu_rlzc_redhat,YY,1,1,zev_rtpprdles2:LZ,

There is a situation ccp_cpu_rlzc_redhat. It experienced a case of a Open->Open transition 1 time at one node zev_rtpprdles2:LZ.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

ccp_cpu_rlzc_redhat,zev_rtpprdles2:LZ,REMOTE_usrdrtm011ccpr2,1180404222639999,1180404222640000,Y,600,1,,,728,*IF *VALUE Linux_CPU.CPU_ID *EQ Aggregate *AND *VALUE Linux_CPU.Busy_CPU *GE 95.00,

Situation was IBccp_cpu_rlzc_redhat, agent was zev_rtpprdles2:LZ. thrunode was ,REMOTE_usrdrtm011ccpr2. Agent_time was 1180404222639999 and TEMS_time was 1180404222640000, one second later. It was an Open event [Y], sampling interval was 500, there was one result and no DisplayItem or Atomize.. The record came from line number 728 in the input. The PDT is shown at the end.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=LNXCPU.CPUID = -1 AND LNXCPU.BUSYCPU >= 9500,

Following the predicate is one result line. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time.

,,,,,,,0,LNXCPU.BUSYCPU=9999;LNXCPU.CPUID=-1;LNXCPU.IDLECPU=1;LNXCPU.ORIGINNODE=zev_rtpprdles2:LZ;LNXCPU.SYSCPU=1306;LNXCPU.TIMESTAMP=1180404222255000;LNXCPU.USRCPU=8693;LNXCPU.USRNCPU=0;LNXCPU.USRSYSCPU=665;LNXCPU.WAITCPU=0,

The goal was to check on BUSYCPU >= 9500 ant it was 9999. Notice here that the raw data is not formatted. The formula mentioned Linux_CPU.Busy_CPU *GE 95.00 but the agent itself deals with integers. The *PREDICATE is a test for >= 9500 and result was 9999 [95.00% and 99.99%]. Formatting takes place later on. That is how formatting for different languages is handled. For example in German the decimal separator is a comma [,]. This can be important when a attribute value is used in an action command run at the Agent – the raw data will be seen and formatting is the responsibility of the action command author.

Next another Open is seen – Descriptor line

ccp_cpu_rlzc_redhat,zev_rtpprdles2:LZ,REMOTE_usrdrtm011ccpr2,1180406073115999,1180406073115000,Y,600,1,,,2130,*IF *VALUE Linux_CPU.CPU_ID *EQ Aggregate *AND *VALUE Linux_CPU.Busy_CPU *GE 95.00,

As can be seen from the TEMS timestamp it was processed at 2018-4-6 07:31:15, some 33 hours later, The *Predicate and Results are shown next;

,,,,,,,P,*PREDICATE=LNXCPU.CPUID = -1 AND LNXCPU.BUSYCPU >= 9500,

,,,,,,,0,LNXCPU.BUSYCPU=9979;LNXCPU.CPUID=-1;LNXCPU.IDLECPU=21;LNXCPU.ORIGINNODE=zev_rtpprdles2:LZ;LNXCPU.SYSCPU=2236;LNXCPU.TIMESTAMP=1180406072115000;LNXCPU.USRCPU=7743;LNXCPU.USRNCPU=0;LNXCPU.USRSYSCPU=346;LNXCPU.WAITCPU=0,

What is the problem and How to fix it?

This often means that an agent was offline for some time because of some system problem, a network problem or maybe some planned outage for upgrades. The same thing would be true of N or Close events. The first time a situation is evaluated, if the condition is not true, a 0 result record is sent from the agent. Unlike with Open events, there are no confirm results sent so the TEMS impact is extremely low.

If these are seen in high volume they should be investigated for network issues or event agent mal-configuration issues. Otherwise it is just an interesting side note on processing logic.

Summary

Tale #9 of using Event Audit History to understand and review a open->open transition in the Event History.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Barn Swallow Chicks Nesting in Big Sur Eves – 2009

 

Sitworld: Event History #8 Situation Events Opening And Closing Frequently

RedCanisters

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 21 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

20,EVENTAUDIT1003W,TEMS,Situations [1] showing more than 1 open<->close transitions per hour per agent – see EVENTREPORT026

One situation showed a frequent rate of event opens and event closes

And in that report section:

EVENTREPORT026: Situations showing high Open<->Close rate

Situation,Reeval,Rate,Node_ct,PDT

IBMSecOnGuardComm_W_Service,120,4.99,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ ‘LS Communication Server’ ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

There is a situation IBMSecOnGuardComm_W_Service. It is a Sampled situation at 120 seconds. About 5 times per hour the situation opens and closes on two different agents.

From EVENTREPORT011 we see

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT

CorpSecOnGuardComm_W_Service,NTSERVICE,1468,120,1677,38.06%,0.17,2120790,3113319720,93.82%,0,0,0,0,0,0,2120790,3113319720,0,0,1677,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ ‘LS Communication Server’ ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

and it represented  a whopping 38% of all events and 94% of the estimated incoming results workload on this remote TEMS. This is an estimate since it is a capture of what is seen in the Situation Event History table. Events that opened outside that capture could be influencing the results workload beyond what is seen here.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405132941999,1180405132941000,Y,120,1,NTSERVICE.SRVCNAME,LS Communication Server,362,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ ‘LS Communication Server’ ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

Situation was IBMSecOnGuardComm_W_Service, agent was Primary:LCDSA4173:NT. thrunode was REMOTE_ibmz360. Agent_time was 1180405132941999 and TEMS_time was 1180405132941000, the same second. It was an Open event [Y], sampling interval was 120, there was one result, the DisplayItem was NTSERVICE.SRVCNAME with a value of  “LS Communication Server”. The record came from line number 362 in the input. The PDT is shown at the end.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=NTSERVICE.CURRSTAT <> ‘Running’ AND ( ( ( NTSERVICE.SRVCNAME = ‘LS Communication Server’ ) OR ( NTSERVICE.SRVCNAME = ‘TermService’ ) ) ),

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time. 

,,,,,,,0,NTSERVICE.ACCONTID=LocalSystem;NTSERVICE.BINARYEX=”C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe”;NTSERVICE.CURRSTAT=Stopped;NTSERVICE.DISPNAME=LS Communication Server;NTSERVICE.LORGROUP=;NTSERVICE.ORIGINNODE=Primary:LCDSA4173:NT;NTSERVICE.SRVCNAME=LS Communication Server;NTSERVICE.STARTYPE=Automatic;NTSERVICE.TIMESTAMP=1180405132753000;NTSERVICE.UACCONTID=LocalSystem;NTSERVICE.UBINARYEX=”C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe”;NTSERVICE.UDISPNAME=LS Communication Server;NTSERVICE.USRVCNAME=LS Communication Server,

Next there is another descriptor line for the N or Close Event record 16 minutes after the open event.

IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405134541999,1180405134541000,N,120,0,NTSERVICE.SRVCNAME,LS Communication Server,374,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ ‘LS Communication Server’ ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

What is the problem and How to fix it?

From this capture, 38% of the events and 94% of the result workload was from this situation. Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is “fixing” the condition.

The formula itself does not make sense. It fires if two particular Windows Services are not in Running state. It is very rare for a process to be always running. Perhaps if it is a highly computational task, like calculating how molecules fold you might see it. However most real life processes spend a good amount of time sleeping or waiting for new work or waiting for paging to complete or many other cases. That is why systems can run many tasks: the workload is shareable since most tasks are not running most of the time.

So this situation is not rare, not exceptional and clearly no one is “fixing” it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers.

A Warning Sign

Situations that open and close more than once an hour should be reviewed for reasonableness. They are a sign that no one really cases about the events.

Summary

Tale #8 of using Event Audit History to understand and review a possibly meaningless situation and thus save resources.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Red Canisters – Cruise Ship Build 2016

 

Sitworld: Event History #7 Events Created But Not Forwarded

Radar1

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 19 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

40,EVENTAUDIT1004W,TEMS,Situations [145] showing event statuses but event not forwarded – see EVENTREPORT018

This some situations forwarded  events to an event receiver like Netcool/Omnibus. However some events were not forwarded.

And in that report section:

EVENTREPORT018: Situations processed but not forwarded

Situation,Count,Nodes,

bnc_disksp_grzm_oraclev10gr,222,8,

bnc_prcmis_xuxp_aixgr,32,5,

bnc_prcmis_xuxp_webadmin_lnkc6,12,1,

… [142 more]

This report shows the situation name, the count of events seen and the number of agents which sent the results and turned into events.

What is the problem and How to fix it?

Collecting results, sending them, creating and managing events is a lot of work for the Agent and the network and the TEMS. Sometimes people create testing situations and then forget to stop and delete them. Sometimes situations are for certain customers who are no longer customers. Sometimes situations were supposed to be forwarded to an event receiver but a mistake was made and the situation was never configured properly. Sometimes it is exactly what is wanted.

In any case, the non-forwarded situations should be reviewed for behaving as desired. In one case the workload on a TEMS was decreased by 75% by deleting situations that were no longer being used. There is potential for more efficiency and for better monitoring in general. After all if a wanted event is never processed at an event receiver, who will ever know that it is missing?

Summary

Tale #7 of using Event Audit History to understand and review situations that are not forwarded.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Where Are We? Radar Domes – Cruise Ship Build 2016

 

Sitworld: Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second

whatsnext

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 17 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

50,EVENTAUDIT1013W,TEMS,Situations [9] lost [merged] events Multiple Results  with same DisplayItem at same TEMS second – see EVENTREPORT004

This particular case showed up when the DisplayItem was configured however two results were produced and processed at the TEMS at the same TEMS second. The Atomize value was identical however and therefore the second result did not produce an event. Therefore an event was lost.

And in that report section:

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

bnc_errpt_xulm_aixgr_01,Pure,1180410020221000,2,bnc_viomtl18x:KUL,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,

There is a situation bnc_errpt_xulm_aixgr_01. It is a Pure situation and at Agent second  1180410020221000 on agent bnc_viomtl18x:KUL there were 2 results.  The DisplayItem was ULLOGENT.UENTRYDESC and the Atomize value was “DC73C03A: SOFTWARE PROGRAM ERROR”. As a result an event was hidden.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

bnc_errpt_xulm_aixgr_01,bnc_viomtl18x:KUL,REMOTE_camtram070nbfra,1180410020217999,1180410020221000,Y,0,2,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,2227,*IF *VALUE Log_Entries.Log_Name_U *EQ ‘errlog’ *AND *VALUE Log_Entries.Class *EQ Software *AND *SCAN Log_Entries.Description *NE ‘A924A5FC:’ *AND *SCAN Log_Entries.Description *NE ‘813FE820:’ *AND *SCAN Log_Entries.Description *NE ‘8FED25B9:’ *AND *SCAN Log_Entries.Description *NE ‘C5C09FFA:’ *AND *SCAN Log_Entries.Description *NE ‘1BA7DF4E:’ *AND *SCAN Log_Entries.Description *NE ‘A6DF45AA’ *AND *VALUE Log_Entries.Type *IN (‘P’,’T’,’U’) *AND *SCAN Log_Entries.Description *NE 0873CF9F *AND *SCAN Log_Entries.Description *NE 2F64580C *AND *SCAN Log_Entries.Description *NE 573790AA *AND *SCAN Log_Entries.Description *NE FE2DEE00,

,

Situation was bnc_errpt_xulm_aixgr_01, agent was bnc_viomtl18x:KUL. thrunode was REMOTE_camtram070nbfra. Agent_time was 1180410020217999 and TEMS_time was 1180410020221000, 4 seconds later. It was an Open event [Y], the DisplayItem ULLOGENT.UENTRYDESC with a value of  “DC73C03A: SOFTWARE PROGRAM ERROR”. There is a line number 2227 of the input. The PDT is shown also,re was no DisplayItem, sampling interval was 300 seconds was and the situation formula checked for the existence of a certain mount point.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=ULLOGENT.ULOGNAME = N’errlog’ AND ULLOGENT.ENTRYCLASS = ‘S’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘A924A5FC:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘813FE820:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘8FED25B9:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘C5C09FFA:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘1BA7DF4E:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘A6DF45AA’) = 0 AND ( ULLOGENT.ENTRYTYPE = ‘P’ OR ULLOGENT.ENTRYTYPE = ‘T’ OR ULLOGENT.ENTRYTYPE = ‘U’ ) AND STRSCAN(ULLOGENT.ENTRYDESC, ‘0873CF9F’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘2F64580C’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘573790AA’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘FE2DEE00’) = 0,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020043000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,1,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020045000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

What is the problem and How to fix it?

As can be seen, the attributes are identical except for ULLOGENT.TIMESTAMP which is 1180410020043000 and 1180410020045000. It might not much difference that the same identical result is seen once or twice.  However if it does make a difference, the TEMS can be configured for Pure Situations to force a separate situation event for each result row.

ITM Pure Situation events and Event Merging Logic

On the other hand, loading down the agent and the TEMS processing the same data over and over seems wasteful. The Tivoli Log Agent can be configured to filter duplicate events and thus save resources at the Agent and the TEMS. Or the situation formula can be altered to exclude this case.

A Warning Note

This situations may violate the general guidance that situations should be Rare, Exceptional and Fixable. It certainly doesn’t seem rare. It feels like someone should fix the issue and thus avoid the monitoring overhead completely.

Summary

Tale #6 of using Event Audit History to understand and correct a  type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: What is Next? – Cruise Ship Build 2016

 

Sitworld: Event History #5 Multiple Results Same DisplayItem Same Second

grandentrace2016

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 16 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation Event Conflict Between DisplayItem and Attributes

This was seen in the Event Audit History Advisory section:

50,EVENTAUDIT1009W,TEMS,Situations [7] with multiple results at agent with same DisplayItem at same second – see EVENTREPORT005

This particular case showed up when the DisplayItem was not configured and so the Atomize value was null. However it could also occur when DisplayItem was specified and the result was null or just an actual duplicate atomize value.

And in that report section:

EVENTREPORT005: Situations with Multiple results at Agent with same DisplayItem at same second

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

bnc_fs_tws_exists_ora_twsq,Sampled,1180410020224000,2,bnc_axedaora09:KUX,,,

There is a situation bnc_fs_tws_exists_ora_twsq. It is Sampled situation and at Agent second  1180410020224000 on agent bnc_axedaora09:KUX there were multiple results.  It has  no DisplayItem configured. As a result an event was hidden.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

bnc_fs_tws_exists_ora_twsq,bnc_axedaora09:KUX,REMOTE_camtram070nbfra,1180410020224000,1180410020226000,Y,300,2,,,2249,*IF *SCAN Disk.Mount_Point_U *EQ ‘/app/tws/twsq’,

Situation was bnc_fs_tws_exists_ora_twsq, agent was bnc_axedaora09:KUX. thrunode was REMOTE_camtram070nbfra. Agent_time was 1180410020224000 and TEMS_time was 118041002022600, 2 seconds later. It was an Open event [Y], there was no DisplayItem, sampling interval was 300 seconds was and the situation formula checked for the existence of a certain mount point.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=STRSCAN(UNIXDISK.UMOUNTPT, N’/app/tws/twsq’) = 1,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,UNIXDISK.DSKNAME=/dev/lv_TWS_twsq;UNIXDISK.DSKSIZE=524288;UNIXDISK.DSKSIZE64=524288;UNIXDISK.DSKSIZEGB=1;UNIXDISK.DSKSIZEGB6=50;UNIXDISK.DSKSIZEMB=512;UNIXDISK.DSKSIZEMB6=51200;UNIXDISK.FILESYSTYP=jfs2;UNIXDISK.FSSTATUS=2;UNIXDISK.INODEFRE64=58591;UNIXDISK.INODEFREE=58591;UNIXDISK.INODESIZ64=61663;UNIXDISK.INODESIZE=61663;UNIXDISK.INODEUSE64=3072;UNIXDISK.INODEUSED=3072;UNIXDISK.MOUNTOPT=rw,log=/dev/hd8;UNIXDISK.MOUNTPT=/app/tws/twsq;UNIXDISK.ORIGINNODE=bnc_axedaora09:KUX;UNIXDISK.PCTINDAVAL=95;UNIXDISK.PCTINDUSED=5;UNIXDISK.PCTSPCAV=49;UNIXDISK.PCTSPCUSED=51;UNIXDISK.SPAVGB=1;UNIXDISK.SPAVGB64=25;UNIXDISK.SPAVMB=256;UNIXDISK.SPAVMB64=25552;UNIXDISK.SPCAVAIL=261644;UNIXDISK.SPCAVAIL64=261644;UNIXDISK.SPCUSED=262644;UNIXDISK.SPCUSED64=262644;UNIXDISK.SPUSEDGB=0;UNIXDISK.SPUSEDGB64=25;UNIXDISK.SPUSEDMB=256;UNIXDISK.SPUSEDMB64=25648;UNIXDISK.TIMESTAMP=1180410020224000;UNIXDISK.UDSKNAME=/dev/lv_TWS_twsq;UNIXDISK.UMOUNTPT=/app/tws/twsq;UNIXDISK.VGN=rootvg;UNIXDISK.ZFILLED=-1;UNIXDISK.ZFILLEDPCT=-1;UNIXDIS

K.ZQUOTA=-1;UNIXDISK.ZREFQUOTA=-1;UNIXDISK.ZREFRESERV=-1;UNIXDISK.ZRESERV=-1;UNIXDISK.ZUCHILDREN=-1;UNIXDISK.ZUDATASET=-1;UNIXDISK.ZUREFRES=-1;UNIXDISK.ZUSNAPS=-1,

,,,,,,,1,UNIXDISK.DSKNAME=/dev/lv_logtwsq;UNIXDISK.DSKSIZE=524288;UNIXDISK.DSKSIZE64=524288;UNIXDISK.DSKSIZEGB=1;UNIXDISK.DSKSIZEGB6=50;UNIXDISK.DSKSIZEMB=512;UNIXDISK.DSKSIZEMB6=51200;UNIXDISK.FILESYSTYP=jfs2;UNIXDISK.FSSTATUS=2;UNIXDISK.INODEFRE64=114500;UNIXDISK.INODEFREE=114500;UNIXDISK.INODESIZ64=116489;UNIXDISK.INODESIZE=116489;UNIXDISK.INODEUSE64=1989;UNIXDISK.INODEUSED=1989;UNIXDISK.MOUNTOPT=rw,log=/dev/hd8;UNIXDISK.MOUNTPT=/app/tws/twsq/TWS/stdlist;UNIXDISK.ORIGINNODE=bnc_axedaora09:KUX;UNIXDISK.PCTINDAVAL=98;UNIXDISK.PCTINDUSED=2;UNIXDISK.PCTSPCAV=98;UNIXDISK.PCTSPCUSED=2;UNIXDISK.SPAVGB=1;UNIXDISK.SPAVGB64=50;UNIXDISK.SPAVMB=503;UNIXDISK.SPAVMB64=50292;UNIXDISK.SPCAVAIL=514988;UNIXDISK.SPCAVAIL64=514988;UNIXDISK.SPCUSED=9300;UNIXDISK.SPCUSED64=9300;UNIXDISK.SPUSEDGB=0;UNIXDISK.SPUSEDGB64=0;UNIXDISK.SPUSEDMB=9;UNIXDISK.SPUSEDMB64=908;UNIXDISK.TIMESTAMP=1180410020224000;UNIXDISK.UDSKNAME=/dev/lv_logtwsq;UNIXDISK.UMOUNTPT=/app/tws/twsq/TWS/stdlist;UNIXDISK.VGN=rootvg;UNIXDISK.ZFILLED=-1;UNIXDISK.ZFILL

EDPCT=-1;UNIXDISK.ZQUOTA=-1;UNIXDISK.ZREFQUOTA=-1;UNIXDISK.ZREFRESERV=-1;UNIXDISK.ZRESERV=-1;UNIXDISK.ZUCHILDREN=-1;UNIXDISK.ZUDATASET=-1;UNIXDISK.ZUREFRES=-1;UNIXDISK.ZUSNAPS=-1,

What is the problem and How to fix it?

As can be seen the agent used attribute group UNIXDISK.UMOUNTPT was /app/tws/twsq in the first result in the second was /app/tws/twsq/TWS/stdlist. Since DisplayItem was not configured, only the first generated an event. As a result the second result and potential event was hidden.

One solution is to set the DisplayItem be .UNIXDISK.DSKNAME or “Disk.Name” as it would be seen in the Situation Editor. In that way you would get two events for the two conditions and thus better monitoring.

A Warning Note

This situations violates the general guidance that situations should be Rare, Exceptional and Fixable. If this situation is true, it is true pretty much forever. And likely no one is going to fix it. What is going to happen is the situation processing will generate confirmation results every sampling interval “forever”. On the other hand, maybe there are relatively few of them and someone is going around an uninstalling some software packages and thus the open events will gradually close. In any event, you should think carefully about always true situations and conditions that are never going to be fixed.

Summary

Tale #5 of using Event Audit History to understand and correct a  type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Future Grand Entrance – Cruise Ship Build 2016

 

Sitworld: Event History #4 Conflict Between DisplayItem and Attributes

ballroom

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 13 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation Event Conflict Between DisplayItem and Attributes

This was seen in the Event Audit History Advisory section:

90,EVENTAUDIT1014E,TEMS,Situations [1] had DisplayItem configured which was not in results – See report EVENTREPORT024

This arose during testing and was a surprise.

And in that report section:

EVENTREPORT024: Situations using unknown DisplayItems

Situation,DisplayItem,

ccp_fss_ulzf_suse,KLZDISK.MOUNTPT,

There is a situation ccp_fss_ulzf_suse.  It has a DisplayItem KLZDISK.MOUNTPT that is unknown – in the sense that the table/column is not found in the attributes. As a result the Atomize value is always null in the results. Because of this condition events can be hidden.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

ccp_fss_ulzf_suse,zec_uspokpchd01:LZ,REMOTE_us22rtm031ccpr1,1180410002908999,1180410002908008,Y,300,2,KLZDISK.MOUNTPT,,6576,*IF ( ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *VALUE Linux_Disk.Mount_Point_U *IN ( ‘/’,’/usr’,’/var’,’/tmp’,’/home’ ) ) *OR ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *SCAN Linux_Disk.Mount_Point_U *EQ ‘/ABAPCS/MON’) *OR ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *SCAN Linux_Disk.Mount_Point_U *EQ ‘/JAVACS/MON’ ) *OR ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *SCAN Linux_Disk.Mount_Point_U *EQ ‘/RDBMS/MON’ ) *OR ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *SCAN Linux_Disk.Mount_Point_U *EQ ‘/SAPAS/MON’ ) *OR ( *VALUE Linux_Disk.Space_Used_Percent *GE 95 *AND *SCAN Linux_Disk.Mount_Point_U *EQ ‘/opt’ ) ),

,

,

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=( ( LNXDISK.PCTSPCUSED >= 95 AND ( LNXDISK.MOUNTPTU = N’/’ OR LNXDISK.MOUNTPTU = N’/usr’ OR LNXDISK.MOUNTPTU = N’/var’ OR LNXDISK.MOUNTPTU = N’/tmp’ OR LNXDISK.MOUNTPTU = N’/home’ ) ) OR ( LNXDISK.PCTSPCUSED >= 95 AND STRSCAN(LNXDISK.MOUNTPTU, N’/ABAPCS/MON’) = 1 ) OR ( LNXDISK.PCTSPCUSED >= 95 AND STRSCAN(LNXDISK.MOUNTPTU, N’/JAVACS/MON’) = 1 ) OR ( LNXDISK.PCTSPCUSED >= 95 AND STRSCAN(LNXDISK.MOUNTPTU, N’/RDBMS/MON’) = 1 ) OR ( LNXDISK.PCTSPCUSED >= 95 AND STRSCAN(LNXDISK.MOUNTPTU, N’/SAPAS/MON’) = 1 ) OR ( LNXDISK.PCTSPCUSED >= 95 AND STRSCAN(LNXDISK.MOUNTPTU, N’/opt’) = 1 ) ),

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,LNXDISK.DSKNAME=/dev/mapper/vgsystem-lv_root;LNXDISK.DSKSIZE=101600;LNXDISK.FSTYPE=ext3;LNXDISK.INODEFREE=6363019;LNXDISK.INODESIZE=6610944;LNXDISK.INODEUSED=247925;LNXDISK.MOUNTPT=/;LNXDISK.MOUNTPTU=/;LNXDISK.ORIGINNODE=zec_uspokpchd01:LZ;LNXDISK.PCTINDAVAL=96;LNXDISK.PCTINDUSED=4;LNXDISK.PCTSPCUSED=97;LNXDISK.SPCAVAIL=3158;LNXDISK.SPCUSED=93282;LNXDISK.TIMESTAMP=1180410001324000,

,,,,,,,1,LNXDISK.DSKNAME=/dev/mapper/vgsystem-lv_root;LNXDISK.DSKSIZE=101600;LNXDISK.FSTYPE=ext3;LNXDISK.INODEFREE=6363014;LNXDISK.INODESIZE=6610944;LNXDISK.INODEUSED=247930;LNXDISK.MOUNTPT=/;LNXDISK.MOUNTPTU=/;LNXDISK.ORIGINNODE=zec_uspokpchd01:LZ;LNXDISK.PCTINDAVAL=96;LNXDISK.PCTINDUSED=4;LNXDISK.PCTSPCUSED=97;LNXDISK.SPCAVAIL=3157;LNXDISK.SPCUSED=93282;LNXDISK.TIMESTAMP=1180410002830000,

What is the problem and How to fix it?

As can be seen the agent used attribute group tablename LNXDISK for all the attributes. However the DisplayItem was KLZDISK.MOUNTPT which does not match anything in the attributes and thus is assigned the null atomize value.

In history, LNX was the attribute group tablename prefix long long ago. However [I think at ITM 6.2 in 2007] this was changed to KLZ as a tablename prefix to avoid conflicts with Unix OS Agent. For compatibility the old names are still recognized and mapped onto each other.  The current situation editor could never produce such a situation today. The only way this could have been generated would be with a situation dump  [tacmd viewsit -s sitname -e sitname.xml] followed by a manual edit to the xml file and then a replace [tacmd createsit -i sitname.xml]. In that circumstance there is no validity checking performed. 

In any case the situation no longer works as expected in this exact case only a single event will be created when two would be expected. This is a monitoring degradation

Summary

Tale #4 of using Event Audit History to understand and correct a  type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Future Ballroom – Cruise Ship Build 2018

 

Sitworld: Event History #3 Lost Events Because DisplayItem has Duplicate Atoms

holding_in_place

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 13 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation Event Hidden because Duplicate DisplayItem Values

This was seen in the Event Audit History Advisory section:

65,EVENTAUDIT1011W,TEMS,Situations [1] lost events because DisplayItem has duplicate atoms

Note that this is different from #2 which was looked at lost situation events because they occurred at the same second at the TEMS.

And in that report section:

EVENTREPORT002: Multiple results in one second and DisplayItem defined

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

CFG_KLO_SWI_LOG_QA,Pure,1180413055342000,16,swift:WAPPRIB00001Q0D:LO,KLOLOGEVTS.CUSLOT1,D:/Alliance/WebPlatformSE/log/swpservice.out,

There is a situation CFG_KLO_SWI_LOG_QA. It is Pure [looking at a log]. At the TEMS second 1180413055342000 [2018-04-13 05:53:42] 16 results were seen. As will be seen later there were actually 34 result segments. The agent name was swift:WAPPRIB00001Q0D:LO – a Tivolu Log Agent. The displayItem was LOLOGEVTS.CUSLOT1 [first custom slot] and the value that all 16 results had was “D:/Alliance/WebPlatformSE/log/swpservice.out, “

Only a single result will be reported, unless you have the TEMS configured for “One Row One Result“.  However the alert from all but one of the  detected conditions has been lost forever. Monitoring has been degraded since you are not getting all the information that is available. If the information is important, you should probably consider setting a different DisplayItem – if possible. Otherwise you can use the “One Row One Result” configuration. 

In this case they look almost all the same and so it doesn’t really matter from a monitoring standpoint although the performance impact on the agent and the TEMS is heavy. This one situation accounted for 91.24% of the 382.46 Kbytes/minute TEMS workload.  The Tivoli log agent should probably be configured to suppress duplicate events in the case.

It is also important to determine if the information is really useful. If the condition is normal and no one is going to fix it, why should it be monitored? That is a customer decision but the question needs to be asked.

Deep dive Into the report details

The full report section is in an appendix since it is so very long. This section selects a typical part part.

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

CFG_KLO_SWI_LOG_QA,swift:WAPPRIB00001Q0D:LO,RTEMS_LEMRNCB00009009,1180413055341999,1180413055342000,Y,0,34,KLOLOGEVTS.CUSLOT1,D:/Alliance/WebPlatformSE/log/swpservice.out,3894,*IF ( ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘SEVERE’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) *OR ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘WARNING’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) ),

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.



,,,,,,,4,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,



,,,,,,,5,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

What is the problem and How to fix it?

The agent logic produced a massive number of result reports during that few seconds. Every one had the

DisplayItem KLOLOGEVTS.CUSLOT1 with value D:/Alliance/WebPlatformSE/log/swpservice.out

Lets break out the separate attributes gathered. Here is an example, not one from above

KLOLOGEVTS.CUINT1=0;

KLOLOGEVTS.CUINT2=0;

KLOLOGEVTS.CUINT3=0;

KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;

KLOLOGEVTS.CUSLOT10=;

KLOLOGEVTS.CUSLOT2=;

KLOLOGEVTS.CUSLOT3=;

KLOLOGEVTS.CUSLOT4=;

KLOLOGEVTS.CUSLOT5=;

KLOLOGEVTS.CUSLOT6=;

KLOLOGEVTS.CUSLOT7=;

KLOLOGEVTS.CUSLOT8=;

KLOLOGEVTS.CUSLOT9=;

KLOLOGEVTS.EIFEVENT=;

KLOLOGEVTS.EVTYPE=0;

KLOLOGEVTS.LOGNAME=swpservice.out;

KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out WARNING;

KLOLOGEVTS.OCOUNT=1;

KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;

KLOLOGEVTS.REMHOST=;

KLOLOGEVTS.TECCLASS=WARNING;

KLOLOGEVTS.TIMESTAMP=1180413055736000,

The formula looked  was

*IF ( ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘SEVERE’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) *OR

       ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘WARNING’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 )

     )

and the agent surely delivered the conditions as requested.

As it happens every single result came from a single agent: swift:WAPPRIB00001Q0D:LO

That condition should be studied and corrected if possible. The situation events themselves are being forwarded to TDST=3 – so maybe the folks responsible for that event destination should be contacted and asked why they are ignoring the events.

The situation formula itself looks over-general. In this case it led to 30-40 results per second arriving at the TEMS all essentially the same from the same agent. Maybe the Occurrence_Count test should be  *GE 2. Maybe the Tivoli log agent should be configured to not send duplicate reports. Maybe the situation itself does not have sufficient value to run.

Correcting this one situation would drop workload on this TEMS by over 90% – and leave room for more useful work.

Summary

Tale #3 of using Event Audit History to understand and correct one type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

Appendix: Full detailed report

This includes two TEMS seconds because the TEMS did not handle all of the results in one second. However the agent created them all at the local second 1180413055341999 and the data capture time seen in KLOLOGEVTS.TIMESTAMP was from 1180413055736000 and 1180413055737000. You often see this sort of spread out response in high result creation mode. It stresses the agent and the TEMS and is often viewed as “wasting” resources.

TEMS Second 1

CFG_KLO_SWI_LOG_QA,swift:WAPPRIB00001Q0D:LO,RTEMS_LEMRNCB00009009,1180413055341999,1180413055341002,Y,0,2,KLOLOGEVTS.CUSLOT1,D:/Alliance/WebPlatformSE/log/swpservice.out,3892,*IF ( ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘SEVERE’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) *OR ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘WARNING’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) ),

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,0,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out WARNING;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=WARNING;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,1,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out WARNING;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=WARNING;KLOLOGEVTS.TIMESTAMP=1180413055736000,

TEMS Second 2

CFG_KLO_SWI_LOG_QA,swift:WAPPRIB00001Q0D:LO,RTEMS_LEMRNCB00009009,1180413055341999,1180413055342000,Y,0,34,KLOLOGEVTS.CUSLOT1,D:/Alliance/WebPlatformSE/log/swpservice.out,3894,*IF ( ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘SEVERE’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) *OR ( *VALUE KLO_LOGFILEEVENTS.Class *EQ ‘WARNING’ *AND *VALUE KLO_LOGFILEEVENTS.Occurrence_Count *GE 1 ) ),

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,0,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out WARNING;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=WARNING;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,1,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,2,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,3,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,4,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,5,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,6,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,7,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,8,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,9,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,10,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,11,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,12,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,13,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,14,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,15,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,16,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055736000,

,,,,,,,17,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,18,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,19,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,20,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,21,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,22,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,23,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,24,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,25,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,26,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,27,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,28,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,29,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,30,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,31,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,P,*PREDICATE=( ( KLOLOGEVTS.TECCLASS = N’SEVERE’ AND KLOLOGEVTS.OCOUNT >= 1 ) OR ( KLOLOGEVTS.TECCLASS = N’WARNING’ AND KLOLOGEVTS.OCOUNT >= 1 ) ),

,,,,,,,32,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

,,,,,,,33,KLOLOGEVTS.CUINT1=0;KLOLOGEVTS.CUINT2=0;KLOLOGEVTS.CUINT3=0;KLOLOGEVTS.CUSLOT1=D:/Alliance/WebPlatformSE/log/swpservice.out;KLOLOGEVTS.CUSLOT10=;KLOLOGEVTS.CUSLOT2=;KLOLOGEVTS.CUSLOT3=;KLOLOGEVTS.CUSLOT4=;KLOLOGEVTS.CUSLOT5=;KLOLOGEVTS.CUSLOT6=;KLOLOGEVTS.CUSLOT7=;KLOLOGEVTS.CUSLOT8=;KLOLOGEVTS.CUSLOT9=;KLOLOGEVTS.EIFEVENT=;KLOLOGEVTS.EVTYPE=0;KLOLOGEVTS.LOGNAME=swpservice.out;KLOLOGEVTS.MSG=D:/Alliance/WebPlatformSE/log/swpservice.out SEVEREl;KLOLOGEVTS.OCOUNT=1;KLOLOGEVTS.ORIGINNODE=swift:WAPPRIB00001Q0D:LO;KLOLOGEVTS.REMHOST=;KLOLOGEVTS.TECCLASS=SEVERE;KLOLOGEVTS.TIMESTAMP=1180413055737000,

C

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Cruise Ship 2018 Needs a Horn

 

Sitworld: Event History #2 Duplicate DisplayItems At Same Second

holding_in_place

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 10 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation Hid Events

This was seen in the Event Audit History Advisory section:

50,EVENTAUDIT1013W,TEMS,Situations [4] lost [merged] events Multiple Events with same DisplayItem at same TEMS second – see EVENTREPORT004

And in that report section:

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

ccp_errlog_xul1_aix,Pure,1180331084920000,4,z9x_z9xssodb01:KUL,ULLOGENT.ENTRYDESC,DE3B8540: PATH HAS FAILED,

There is a situation ccp_errlog_xul1_aix. It is Pure [looking at a log]. At the TEMS second 1180331084920000 [2018-03-31 08:49:20] 4 results were seen. The agent name was z9x_z9xssodb01:KUL – a Unix Log Agent. The displayItem was ULLOGENT.ENTRYDESC [the description] and the value that all four results had was “DE3B8540: PATH HAS FAILED, “

Only a single result will be reported, unless you have the TEMS configured for “One Row One Result“.  However the alert from three of the four detected conditions has been lost forever. Monitoring has been degraded since you are not getting all the information that is available. If the information is important, you should probably consider setting a different DisplayItem – if possible. Otherwise you can use the “One Row One Result” configuration.

It is also important to determine if the information is really useful. Probably it is because someone constructed the situation. However in the case it is not really useful, the situation should be stopped and deleted.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

ccp_errlog_xul1_aix,z9x_z9xssodb01:KUL,REMOTE_usrdrtm041ccpr2,1180331084920999,1180331084920000,Y,0,4,ULLOGENT.ENTRYDESC,DE3B8540: PATH HAS FAILED,2602,*IF ( ( *VALUE Log_Entries.Log_Path *EQ ‘/var/adm/ras/’ *AND *VALUE Log_Entries.Log_Name *EQ errlog *AND *VALUE Log_Entries.Type *EQ P *AND *VALUE Log_Entries.Class *EQ Hardware *AND *SCAN Log_Entries.Description *NE 4865FA9B *AND *SCAN Log_Entries.Description *NE 476B351D ) *OR ( *VALUE Log_Entries.Log_Path *EQ ‘/var/adm/ras/’ *AND *VALUE Log_Entries.Log_Name *EQ errlog *AND *VALUE Log_Entries.Type *EQ U *AND *SCAN Log_Entries.Description *EQ 4B6BA416 ) ) *UNTIL ( *TTL 7:00:00:00 ),

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were 3 P lines and 4 index lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk7;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk7;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

,,,,,,,1,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk9;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk9;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

,,,,,,,2,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk10;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk10;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,3,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk11;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk11;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

What is the problem and How to fix it?

The agent logic produced 4 results tagged with the same agent time and only a single event was created. This *might* not be any problem if they were all identical – although you might want to see if the agent has controls to suppress duplicate results. Or if the data is boring and no one needs to fix it, adjust the formula or even stop and delete the situation. The biggest performance improvement you can ever achieve is NOT doing unneeded work.

In this particular case, DisplayItem was set to the first 128 characters of ULLOGENT.ENTRYDESC and in fact all four results had “DE3B8540: PATH HAS FAILED” and so all were merged into one event and 3 events were lost. Looking through all the results row attributes I see

0: ULLOGENT.ENTRYSRC=hdisk7

1: ULLOGENT.ENTRYSRC=hdisk9

2: ULLOGENT.ENTRYSRC=hdisk10

3: ULLOGENT.ENTRYSRC=hdisk11

For this particular case, if you made DisplayItem be ULLOGENT.ENTRYSRC you would get the actual four event data that was actually present but three hidden. It  passes a reasonableness test in that the number of cases looks limited. If a DisplayItem has something like a date or time or a callstack, it can cause a TEMS performance data because all cases of situation/node/displayitem status are kept in storage and the TSITSTSC table and it grows and grows forever [or at least until the TEMS is recycled]

This is just one case, so you should review others and talk it through with Subject Matter Experts on the contents of that log.  Your work isn’t done until you understand all the possible ramifications.

Summary

Tale #2 of using Event Audit History to understand and correct one type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary object associated with this project.

1.000000

initial release

Photo Note: Holding Things Together on Cruise Ship Build 2018

 

Sitworld: Event History #1 The Situation That Fired Oddly

Vent2

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 4 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation That Created Too Many Events

Title: False ESX reboot alert

Description: … We are getting false alert for ESX up time monitoring, it is happening randomly for different servers . … I found Alert get triggered when system up time show values as 4294967295 . it is happening for all those false triggered alert…

The situation formula as seen from a tacmd viewsit was simple.

Formula        : *IF *VALUE KVM_SERVER.System_up_time *LT 600

Sampling Interval   : 0/0:5:0

Just after startup the system being monitored would have an up time below 600 seconds. That formula creates an situation event. A while later the up goes above 600 seconds and the event would close.

Event History Audit

The new tool was run on the remote TEMS database files which were sent with the problem report. If you were doing this yourself you would use the eventdat.pl script. The example follows that end user path.

perl eventaud.pl -lst -allresults

The -allresults means the last report section displays all the results in an easier to understand format. Usually it shows only the situations that triggered advisories.

That final report section is displayed in order by 1) node or agent name, 2) Thrunode [usually remote TEMS], 3) Situation Name, 4) DisplayItem, 5) by TEMS processing second. There is a massive amount of information present and I will show one small snippet of the report relating to the problem at hand.

Here are the lines of interest:

IBM_ESXReboot_W_Test,VM:XXXX232V-ibmesxcdc030:ESX,REMOTE_IBM010,1180327071516999,1180327071516000,Y,300,1,KVMSERVERG.SH,ibmesxcdc030.amer.ibm.corp,1891,*IF *VALUE KVM_SERVER.System_up_time *LT 600,

,,,,,,,P,*PREDICATE=KVMSERVERG.SUT < 600,

,,,,,,,0,KVMSERVERG.AVCPR=83;KVMSERVERG.BIOS_DATE=;KVMSERVERG.BN=3248547;KVMSERVERG.CEM=;KVMSERVERG.CLUSTER=CDC-Linux-Non-Prod-Cluster-01;KVMSERVERG.CP=0;KVMSERVERG.CS=connected;KVMSERVERG.DATACENTER=CSC Chicago Data Center (CSC);KVMSERVERG.DEMAND=0;KVMSERVERG.DM=datacenter-862;KVMSERVERG.DS=283436;KVMSERVERG.EU=0;KVMSERVERG.FQN=;KVMSERVERG.HE=0;KVMSERVERG.IDIS0=0;KVMSERVERG.IDIS1=0;KVMSERVERG.IDIS2=0;KVMSERVERG.IDIS3=0;KVMSERVERG.IDIS4=0;KVMSERVERG.IDIS5=0;KVMSERVERG.IDIS6=0;KVMSERVERG.IDIS7=0;KVMSERVERG.IDIS8=0;KVMSERVERG.IDIS9=0;KVMSERVERG.IP_ADDRESS=;KVMSERVERG.LATENCY=0;KVMSERVERG.MEM=;KVMSERVERG.MM=0;KVMSERVERG.NICS=8;KVMSERVERG.NODEID=;KVMSERVERG.NUMBER_VMS=22;KVMSERVERG.NVO=13;KVMSERVERG.OCU=7;KVMSERVERG.OMU=47;KVMSERVERG.ORIGINNODE=VM:FAIN232V-ibmesxcdc030:ESX;KVMSERVERG.OS=green;KVMSERVERG.PC=12;KVMSERVERG.PC0=0;KVMSERVERG.PEP=0;KVMSERVERG.PER=0;KVMSERVERG.PF=;KVMSERVERG.PM=196544;KVMSERVERG.PRODUCT=VMware ESXi;KVMSERVERG.PS=;KVMSERVERG.PU=0;KVMSERVERG.SAML=0;KVMSERVERG.SH=ibmesxcdc030.amer.ibm.corp;KVMSERVERG.SM=;KVMSERVERG.SN=;KVMSERVERG.SPML=0;KVMSERVERG.SUT=4294967295;KVMSERVERG.SV=;KVMSERVERG.TCM=31908;KVMSERVERG.TIMESTAMP=1180327090935000;KVMSERVERG.TVCM=0;KVMSERVERG.TVPS=0;KVMSERVERG.UCM=0;KVMSERVERG.UD=130917;KVMSERVERG.UUID=2001FF01-0000-0000-0000-00000000005F;KVMSERVERG.VE=Yes;KVMSERVERG.VERSION=5.5.0,

This is a little messy to view but it is simple compared to some. The first line is a header summary which starts off with the situation name, the agent name, the remote TEMS, agent time, TEMS Time, Status [Y=open], 300 [seconds sampling interval]. number of results, the DisplayItem value and the Situation Formula or PDT. In the full report section there is a header title line.

The second line is the predicate or formula summary. The names here all use the Attribute Group Table and Attribute Column. It is exactly parallel to the first line PDT:

The third line shows the attributes sent with the situation being open. In particular note the KVMSERVERG.SUT=4294967295;   Using a Decimal to Hex web calculator that number is equivalent to X’FFFFFFFF’ and in signed arithmetic that is -1. Numeric attributes are usually kept in signed 4 byte integers.

In summary the situation fired because  -1 < 600   is a true statement. The Agent also continued to send the identical information every 300 seconds – which is how situation processing works. This was done at 22 agents [connecting to this remote TEMS] and comprised 4.27% of the estimated Situation workload.

What is -1?

Much data gathered by the agent comes from calls using APIs to the monitored system.  The negative values typically mean the API call could not return the data for some reason. Sometimes that is documented in the Agent Manual, sometimes not. It cannot really be an actual second number because 4294967295 seconds would be roughly 68 years.

Situation Formula reworked

The rework is rather easy.

*IF *VALUE KVM_SERVER.System_up_time *LT 600 *AND *VALUE KVM_SERVER.System_up_time *GE 0

will screen out the negative values.

It might be that -1 is a signal of a true error condition. That would probably need to be worked out with the agent support people or perhaps the vendor and API usage. In that case you could have a separate situation to track them down and fix them.

*IF *VALUE  KVM_SERVER.System_up_time *LT 0

Summary

Tale #1 of using Event Audit History to diagnose a Situation mystery.

Sitworld: Table of Contents

History and Earlier versions

There are no binary object associated with this project.

1.000000

initial release

Photo Note: Between Deck Vent on Cruise Ship Build 2018

 

Sitworld: Event History Audit

edboat2

Version 1.39000 –  23 August 2020

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

This project had several inspirations.

First came from a desire to estimate situation workload impact without any extra tracing and in general discovering aspects of how results are processed at the agent and events are processed in the TEMS.

Second arose from a customer case involving a situation that did not fire [a guard dog that didn’t bark!]. After lengthy effort, the root cause was determined: the situation used a DisplayItem to make sure that all the monitored database processes offline were reported. In this problem case, the DisplayItem attribute “database name” was blank or null. Since there were a several such disabled database processes, only one of the blank ones generated an event and the rest were suppressed. The situation was rewritten to use a different DisplayItem that always supplied database name and a re-test showed events were created for all the database processes. Thus passed two months of pain. Diagnosing some such cases faster or ahead of time would improve monitoring with minimal effort.

Third is a goal to calculate other metrics such as the time delay between data processed at the agent and data processed at the TEMS. This can point out system performance issues and even system clock issues.

TL;DR

Use the supplied script eventdat.pl to captiure  the TSITSTSH table at the hub and each remote TEMS. Then use the eventaud.pl script to detect and correct any problems. You will get more valid situation events. The eventaud.csv report contains explanations of the advisory messages and the report sections. Read the section here on the summary report. There are many other report sections and these are documented at the end of the report itself.

DisplayItem

When situations are started, processing runs on the agents which evaluate conditions defined by the Situation Formula. The situation formula is a filter that takes the available data and determines which results should be passed on to the TEMS. Some formulas are run periodically and those are called Sampled Situations. Other formulas are run as needed to filter incoming sequential log-type data and those are called Pure events.

Sometimes a formula can pass back multiple results. Consider a check for unix mount point with free space lower than desired. Multiple disks can be lower than desired and multiple result rows returned. TEMS uses a DisplayItem to create individual situation events. A DisplayItem is a 128 [or less] beginning of another attribute. When these DisplayItems create unique values per result row, TEMS can create individual events. If DisplayItem is needed but not supplied, only one event is created, thereby hiding real events. The event is chosen somewhat randomly. If the selected DisplayItem is present but does not supply unique values [clearly an Agent issue – why supply a DisplayItem that does not uniquely identify a result row?] you can again miss events.

That issue can also happen with Pure situations. In addition the TEMS has performance logic such that if the same pure event with the same DisplayItem arrives from the same agent in the same second, the second and subsequent results are merged. At the TEP display the merged events can be seen. However they are not transmitted to the event receiver. If a DisplayItem is not defined [or not available], you can again have missed events. Even if you have a DisplayItem, if a flurry arrive in the same second with the same DisplayItem, you again have missed events. Sometimes no one minds: after all if you have one event or twenty in the same second – does anyone really care. There is a way to configure the TEMS to guarantee that every pure result is processed into a single event [see product documentation here]. I must point out that such a configuration can create a serious impact on TEMS and event receiver logic – Is there any real purpose in generating scores of events per second for essentially duplicate issues? As an alternative many such agent sources like Tivoli Log Agent have agent-side controls to hide duplicate identical reports.

Result Workload Measurement

Measuring result workload also has value. One recent case involved a simple sort of situation designed to measure Unix OS Agent CPU usage and alert when too high [more that 50%]. The Event Audit report showed that a staggering 96% of the incoming workload on that remote TEMS was coming from that situation. The logic to work backward… from the Situation Status History table, to the agent side result creation and the TEMS handling of the incoming results was complex but extremely rewarding.

This can also be rewarding with a negative result. A recent case showed signs of overload, but a time based estimate of incoming results showed nothing special around the time of the instability. The TEMS was processing peak levels of 3-4 results per second and most often cruised along at 0.5 to 1.0 results per second. As a result a lot of customer work gathering data and doing restarts was avoided. The root cause was an external influence which blocked tcp communication.

Other reports can show cases where the situation fluctuates between open and close and open and close, etc.  Often that suggests a useless situation since it fails the basic goal of situations – to be rare, exceptional and fixable.

Getting the report

You can make this report yourself as described using an available package. You can also create a Case with IBM Support and upload one or more hub or remote TEMS pdcollects. Ask in the Case that the eventaud.csv report be sent back to you.

Package Installation

The package is eventaud.1.39000. It contains

1) Perl script eventdat.pl – to gather the needed data from the ITM environment

2) Perl script eventaud.pl – to process the data and produce an Event Audit report.

I suggest eventdat.pl be placed an ITM installation tmp directory.  For Windows you need to create the <installdir>\tmp directory. For Linux/Unix use the existing <installdir>/tmp  directory. You can of course use any convenient directory. Any examples following will use these defaults

Linux/Unix:  /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always come with the Perl shell processor installed. For Windows you can install a no cost Community version from http://www.activestate.com if needed. You can also extract the files and copy them to a system where Perl is installed. No CPAN [community donated source] packages are needed.

Parameters for running eventdat.pl

All parameters are optional if defaults are taken

-h home installation directory for TEPS. Default is

Linux/Unix: /opt/IBM/ITM

Windows: c:\IBM\ITM

This can also be supplied with an environment variable

Linux/Unix: export CANDLEHOME=/opt/IBM/ITM

Windows: set CANDLE_HOME=c:\IBM\ITM

-v produce progress messages, default off

-tems specify a TEMS to get event history – can supply multiples using multiple -tems. Default is all online TEMSes except FTO mirror hub TEMS.

-work  directory to store work and report files, default is current directory

-off include offline agents, default off

-aff needed in rare database cases where affinities are in different structure

-redo used during testing by reusing result files

The eventdat.pl script will create several files

HUB.TSITDESC.LST

HUB.TNAME.LST

and one or more situation event history files of this form tems_name.TSITSTSH.LST.

The tems_name is HUB for the hub TEMS and otherwise it is the TEMS nodeid.

The eventdat.pl shell command also creates two shell files:  run_eventaud.cmd  and run_eventaud.sh. They invoke the eventaud script with needed parameters. “.cmd” for Windows, “.sh” for Linux/Unix… only difference is the line end characters to make each environment happier.

Parameters for running eventaud.pl

All parameters are optional if defaults are taken.

-h  Help display

-v Verbose display – more messages

-nohdr skip report header – used in regression testing

-txt read input from TEMS2SQL processed files  QA1CSITF.DB.TXT, QA1DNAME.DB.TXT, QA1CSTSH.DB.TXT

-lst [with no following TEMS name]  get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST

-lst with following TEMS name, get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, <temsnodeid>.QA1CSTSH.LST

-allresults display more results data, especially when the default single result is present

-sum produce a summary file eventaud.txt

-time  [to be implemented]

-days limit look back to this many days before the most recent event, default 7 days, value 0 means use all data,

-o name of output file, default is eventaud.csv

-odir directory for output files, default /tmp or c:\temp

-tsitstsh filename of Situation Status History extract file

-workpath  directory to store working files

Running eventaud,pl is just putting the needed files into a directory and running it like this:

cd /tmp where HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST \are present

perl eventaud.pl -lst

If you had a remote TEMS with the name RTEMS1 the commands would be

cd /tmp

perl eventaud.pl -lst -tsitstsh RTEMS1

The report file is eventaud.csv  and contains the advisories, reports and explanations at the end of the report.

Summary Report format

The eventaud.csv report is lengthy with advisories and report sections. To avoid the burden of over-documentation that few will read, the report itself contains documentation at the end which describes each advisory and report section including the meaning, portions of an example and suggested recovery action if required. This internal documentation is under revision as new diagnoses are made. For this blog post we will show and discuss a summary report section and what it all means. Most of the exception cases in the summary are explained in more detail in the report sections that follow.

EVENTREPORT000: Event/Result Summary Budget Report

Duration: 4476 Seconds

==> The time between the oldest event history and the most recent in seconds. The default -days 7 will translate to 604800 seconds. This especially important for testing environments where you may have some years old events. The only times measured at the Y [open event], N [close event] and X [problem]  DELTASTAT column values. The S [Start Situation] is used in order to limit event calculation since the last time situation was started – usually when the TEMS started. Start Situations are not included in the time calculation. They are usually followed with a flood of N [close event] records which are not very interesting.

Total Open/Close Events: 337 4.52/min

==> Count of Y [open] and N[close] records over the duration captured. Total count and rate per minute is displayed. This are usually far less than results since most of the result rows are conforming existing Open event conditions.

Total Results: 3507 47.01/min

==> These are result rows sent from the agents. Sampled situation sent a “confirming” result every sampling interval. In fact if such a confirming result is not observed for three sampling intervals, the TEMS will generated an automatic close of the event. This can be confusing especially if the condition is supposed to be always true. It usually means there is some agent or TEMS or Networking interference happening. Definitely not good. In well running systems this never happens. If this is a hub TEMS, the results apply to all the hub and remote TEMS together and thus do not reflect stress on the hub System only.

Total Non-Forwarded Results: 1444 19.36/min [41.17%]

==> This records how many of the result rows were for Situations which were not configured to an event receiver like Netcool/Omnibus. This is only counted if some Situations were configured to send data to an event receiver. These should be considered because if no event receiver is processing the conditions, the issues will never be fixed and thus it is often a waste of time and resources. The last number is the percent non-forwarded compared to total results. This impacts the agent and the remote TEMS but not the hub TEMS… the hub TEMS only sees the Y and N results.

Total Result Bytes: 3506892 45.91 K/min Worry[9.18%]

==> This is an estimate of the number of bytes sent from the agent  and Kbytes per minute. The Worry% is how close this is to the 500K/min level which is where you should be worried. This is not an absolute number. The actual point where a remote TEMS starts to break down could be higher or lower depending on the system power, competing workload and network. Your own experience is the best guide. The largest number ever seen was 93 megabytes/min and of course the remote TEMS were malfunctioning badly. Event Audit numbers are estimates only you would need to use TEMS Audit and tracing to get more precise numbers.

Total Non-Forwarded Result Bytes: 744464 9.75/min [21.23%]

==> This shows the portion of arriving result data associated with non-forwarded situations. See above for more discussion.

Sampled Results Confirm: 3479 46.64/min

==> Many incoming result rows are used to confirm existing open situation events. This is often a high proportion of the total. If the remote TEMS is under stress, you can reduce the stress by increasing the sampling interval, by adding a new remote TEMS to divide the workload, or by using a more powerful system,

Sampled Results Confirm Bytes: 3421828 44.79 K/min, 97.57% of total results

==> This is the bytes require to confirm open situation events. the number at the end is the per cent of total incoming result bytes.

Missing DisplayItem: 26 0.35/min

==> This counts the number of cases where a situation does not have DisplayItem configured and  this has hidden situation events. This degrades the monitoring performance by not creating situation events to resolve problems.

Duplicate DisplayItem: 27 0.36/min

==> This counts the number of cases where a situation has a DisplayItem but at the agent multiple results were seen with identical DisplayItem values. This may degrade the monitoring performance by not creating situation events to resolve problems. Usually this means you cannot use that DisplayItem reliably. It may be considered an Agent issue since the Agent should not offer a DisplayItem that does not uniquely identify the result rows. The practical solution is to select another DisplayItem that does uniquely identify result rows. For example in Unix select Process_ID and not Process Command Name.

Null DisplayItem: 100 1133.57/min

==> This is much like the previous Duplicate case but it means Null values were returned. This has the same negative impact and the same considerations.

Pure Merged Results: 6 0.08/min

==> This reports on cases where multiple Pure Situation results, with same DisplayItem value, were processed at a single second at the TEMS and were suppressed by a different mechanism. If this is important,  a per TEMS configuration can be made to enforce one result equals one event logic.

Open/Open transitions: 0

==> This rare condition is when an open event is followed by a second open event. It is not well understood.

Close/Close transitions: 0

==> This rare condition is when an close event is followed by a second close event. It is not well understood.

Delay Estimate opens[334] over_minimum [114] over_average [1.81 seconds]   

==> This is a stress indicator of how how the agent and network and TEMS handle incoming results. It can report as low as 0.0 seconds. If the TEMS is heavily loaded the number will tend to be larger.  If you see a very large number that often means there is one or more agents running on a system, with a time vastly different that the TEMS time. That can be normal or abnormal and there is a report section devoted to which agents are showing cases where the time is sometimes much larger than the minimum time observed.

Reporting problems

If things do not work as expected, please capture input LST files in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly. Also please pass on any interesting ideas.

Summary

The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents. You can expect to see other associated posts concerning different aspects of event history.

Sitworld: Table of Contents

History

This project is also maintained in github.com/jalvo2014/eventaud  and will often be more up to date [and less tested] compared the the point releases. You can also use this github distribution to review history and propose changes via pull requests.

eventaud.1.39000
Add/correct some table row sizes

eventaud.1.38000
Add/correct some table row sizes

eventaud.1.37000
Add/correct some table row sizes

eventaud.1.36000
Add/correct some table row sizes

eventaud.1.35000
Add report033 on estimated TSITSTSC cache usage and constant on situations

eventaud.1.34000
Add/correct some table row sizes

eventaud.1.33000
Add/correct some table row sizes

eventaud.1.32000

Add/correct some table row sizes

Add predicate related attributes at start in full report

Note: Building a new Cruise Ship 2018

Sitworld: TEMS Audit Tracing Guide Appendix

ArtDecoFelineimage

Version 1.64000 31 May 2017

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Following is the information on tracing for TEMS Audit

There was a document size limit but is logically a part of it the document.

Appendix 1

Some of the advisories and report sections require diagnostic tracing. We will use a standard workload tracing for these examples. Independent of the implementation the diagnostic trace string looks like this:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

You always need error. The rest are additions to trace more.

There are multiple ways to set this tracing:

Static Trace Definitions – requires a process recycle

Linux/Unix

The best way is to add a file ms.environment to the <installdir>/config directory which has the same attributes/owner/group as ms.ini. Use touch/chmod/chown/chgrp to create such a file. If one already exists use it. Add the following to that file – one long line.

KBB_RAS1= error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

Windows

Using MTEMS, right click on the TEMS line, click Advanced, click Edit Trace Parms…

In the Ras1 Filter enter

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

Note there is no KBB_RAS1=  in this context.

z/OS

Add the following to the RKANPARU(KDSENV) file

KBB_RAS1= error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

At that point recycle the TEMS and collect the data.

The ms.environment technique works from ITM 623 GA onward. Before that you can achieve the same goal by updating the TEMS config file

hostname_ms_temsnode.config

with the configuration string added in single quotes like this:

KBB_RAS1= ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)’

Such a temporary update will be lost during a TEMS configuration, so it is fine for cases like this.

Dynamic – tacmd settrace

The best modern way to run such a command is via tacmd settrace. The sequence looks like this when tacmd is run from Linux/Unix. Note that the <temsnodeid> is the TEMS nodeid and not the hostname running the TEMS. This what would be seen in a tacmd listsystems output.

Linux/Unix

cd <installdir>/bin

./tacmd login -s ….   [login to hub TEMS]

./tacmd settrace -m <temsnodeid>  -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)’

Note the single quotes around the diagnostic trace parameters.

After some time – usually a couple hours – you can disable the trace like this:

cd <installdir>/bin

./tacmd login -s ….   [login to hub TEMS]

./tacmd settrace -m <temsnodeid>  -p KBB_RAS1 -r

Windows

The commands are largely the same but quoting is different. External quotes are double quotes and embedded double quotes are tripled.

cd <installdir>bbin

tacmd login -s ….   [login to hub TEMS]

tacmd settrace -m <temsnodeid>  -p KBB_RAS1 -o “error (unit:kpxrpcrq,Entry=”””IRA_NCS_Sample””” state er)(unit:kshdhtp,Entry=”””getHeaderValue”””  all) (unit:kshreq,Entry=”””buildSQL””” all)(unit:kfastpst,Entry=”””KFA_PostEvent””” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”””ProcessTable””” all er)(unit:kraafira,Entry=”””runAutomationCommand””” all)(unit:kglhc1c all)”Note the single quotes around the diagnostic trace parameters.

After some time – usually a couple hours – you can disable the trace like this:

cd <installdir>\bin

tacmd login -s ….   [login to hub TEMS]

tacmd settrace -m <temsnodeid>  -p KBB_RAS1 -r

 

Example Trace Strings

Hub TEMS basic workload:

./tacmd settrace -m <temsnodeid>  -p KBB_RAS1 -o “error (unit:kpxrpcrq,Entry=”””IRA_NCS_Sample””” state er)(unit:kshdhtp,Entry=”””getHeaderValue”””  all) (unit:kshreq,Entry=”””buildSQL””” all)(unit:kfastpst,Entry=”””KFA_PostEvent””” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”””ProcessTable””” all er)(unit:kraafira,Entry=”””runAutomationCommand””” all)(unit:kglhc1c all)”

Remote TEMS basic workload:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) (UNIT:kfaprpst ER ST) (UNIT:kfastinh,ENTRY:”KFA_InsertNodests” ALL)(unit:kdssqprs metrics in er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

Hub TEMS Workload plus Heartbeat:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) (UNIT:kfaprpst ER ST) (UNIT:kfastinh,ENTRY:”KFA_InsertNodests” ALL)(unit:kdssqprs metrics in er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)

Remote TEMS Workload plus Heartbeat:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) (UNIT:kfaprpst ER ST) (UNIT:kfastinh,ENTRY:”KFA_InsertNodests” ALL)(unit:kdssqprs metrics in er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)(UNIT:kfaprpst ST ER)

Hub TEMS plus heartbeat plus KPX traces to watch input in detail:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)(UNIT:kfaprps ST ER)

Remote TEMS plus heartbeat plus KPX traces to watch input in detail:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) (UNIT:kfaprpst ER ST) (UNIT:kfastinh,ENTRY:”KFA_InsertNodests” ALL)(unit:kdssqprs metrics in er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)(UNIT:kfaprps ST ER)(UNIT:kfastinh,ENTRY:”KFA_InsertNodests” ALL)(UNIT:kpxreq ALL)(UNIT:kpxreqds ALL)

Rarer dynamic options

There was an earlier way to make dynamic tracing changes documented here:

Dynamically modify trace settings for an IBM Tivoli Monitoring component

http://www-1.ibm.com/support/docview.wss?rs=0&uid=swg21266129

It is sometimes blocked by firewall restrictions and lack of login credentials to the system running the TEMS.

There is also a z/OS TEMS option which looks a bit like

CTDS TRACE ADD FILTER ID=001 UNIT=KOCACHE CLASS(ALL)

If  must use that please contact the author for details.

  Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

1.64000 – first publication of the trace appendix

Sitworld: Table of Contents

Note: Art Deco Cat sculpture

 

Sitworld: ITM 6 Interface Guide Using KDEB_INTERFACELIST

MudCreekSlide

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Overview

ITM 6 processing uses TCP/IP communications. In a z/OS system ITM 6 can also use System Network Architecture [SNA] communications. That is a separate topic and will not be addressed in this technote.  A good starting point to understanding ITM communications technology is documented here:

ITM Port Usage and Limiting Port Usage

ITM 6 communications requires managing network interfaces. Network interfaces are identified by ip addresses. In many systems there is a Network Interface hardware device associated with  ip address(es). There can also be software defined interfaces such as localhost [127.0.0.1] or those installed by virtualization software. Multiple ip addresses can be associated with a single Network Interface device.

The ITM 6 KDE component discovers the interfaces and creates a list of ip addresses that are suitable for connections. These addresses are registered into the Location Broker [think phone book]. ITM interface discovery logic ignores ip addresses like 127.0.0.1. It will pick an order or priority of registration based on the sequence seen during discovery.

Note that this only applies to ITM 6 server processes such as TEMS and WPA and KDE_Gateway. Agents use location broker information gathered at the services they connect to.

ITM 6 communications usually works very well using default processing. However in complex cases, you must control interface usage using these two environment variables:

KDEB_INTERFACELIST

KDEB_INTERFACELIST_IPV6

See Appendix 1 for adapting to the disruptive change that was temporarily introduced at ITM 623 FP1 and ITM 622 FP8 and was reverted at ITM 623 FP3.

This document ignores KDEB_INTERFACELIST_IPV6 – it works exactly the same but for IPV6 networks.

KDEB_INTERFACELIST Controls

The format of this environment variable format is a string having one or more segments separated by blanks. An individual segment looks like this: [control][ip_address]

control can be

absent meaning no operator

! meaning exclusive

meaning subtractive

+ meaning additive

ip_address can be

absent

a name which resolves to an ip address

an asterisk which means the interface associated with hostname.

Here are specific usage cases.

KDEB_INTERFACELIST Simple Address Control (no operator)

KDEB_INTERFACELIST=192.168.1.1

After discovery this ip address will be placed first.

KDEB_INTERFACELIST=192.168.1.1 192.168.1.100

After discovery these two ip addresses will be placed first.

KDEB_INTERFACELIST Exclusive Bind (! Operator)

KDEB_INTERFACELIST=!192.168.1.1 and

In default configuration, ITM 6 communications will listen on all available interfaces for incoming work. That is called non-exclusive bind. When the exclamation mark prefixes the ip address, that single interface will be listened on and no other. That is called exclusive bind.

When exclusive bind is used, all ITM 6 processes must use exclusive bind. The usage must be coordinated. You must never mix exclusive and non-exclusive bind as the agents will overwrite each other’s connections constantly.  

KDEB_INTERFACELIST=!*

In some platforms, one interface will be specially tagged as “hostname”. The above setting will listen exclusively to that interface.

It is quite normal and useful to have multiple ITM processes running with different exclusive binds. For example on Linux/Unix you can install multiple remote TEMS each using a different exclusive ip address [and different install directories]. So the requirement is that the exclusive binds must be coordinated.

KDEB_INTERFACELIST Subtraction Control (- operator)

KDEB_INTERFACELIST=-192.168.1.100

This means that a normal survey of all interfaces is performed, but the 192.168.1.100 interface is removed from the list after discovery.

KDEB_INTERFACELIST_IPV6=-

This means that a normal survey of all interfaces is performed, but all IPV6 interfaces are ignored. That can be useful when the IPV6 is not ready for production usage.

KDEB_INTERFACELIST Addition Control (+ operator)

KDEB_INTERFACELIST=+192.168.1.100

This means that a normal survey of all interfaces is performed, but the 192.168.1.100 interface is added to the survey.

Practical examples

Case 1: The system has three interfaces. The first two can be reached by agents but the third is dedicated to a backup function and should not be advertised in the

Location Broker (think phone book) nor listened on by the Tivoli process.

KDEB_INTERFACELIST=-192.168.1.100

This subtracts the ip address for the backup function

Case 2: The system has IPV6 interfaces, but there is no working IPV6 networking in the environment.

KDEB_INTERFACELIST_IPV6=-

That could be done to simplify processing and avoid publishing an interface that is not in face useful.

Case 3: The system has two interfaces which are valid for connection from agents.

Do not supply any controls unless required by other ITM processes on the system. The agents will connect to the location broker and then attempt a connection on each server interface. Whatever works, that will be the one the agent uses. There is no need to supply a priority order. For example one site had agents on two different sub-networks and each needed to connect to the TEMS via different ip addresses.

Case 4: The ITM process is an agent. That includes TEPS.

Do not supply any controls unless required by other ITM processes on the same server.

Case 5: Two remote TEMS on a single Linux/Unix system

One interesting example is on a Linux/Unix system where two different TEMSes are installed in different install directories.

This requires two separate dedicated interfaces. For each of them use the exclusive control:

KDEB_INTERFACELIST=!192.168.1.1

KDEB_INTERFACELIST=!192.168.1.100

In the same way you could set up two OS Agents, each using one interface and using a separate interface… perhaps each connecting to a different ITM hub TEMS.

Case 6: Specified Interfaces

There are cases where a specified interface must be discovered first. For agents, this will be the address the agent puts into the Node Status as the ip address and port number to contact. One good example is a cluster configuration where the virtual IP address must be registered – so the contact information remains the same no matter what the actual system the agent is running on.

This is still easily accomplished:

A) If there is a single usable interface specify nothing.

B) If there are two or more interfaces, list the specified interface first and then specify any other interface. In this case the specified interface will be discovered first and used for registration.

Case 6: Too Many Interfaces

If you see this error message

Status 1DE00046 KDE1_STC_INTERFACELIMITREACHED error

That means there are too many interfaces to continue.

Disable interface discovery by adding the following parameter to the Agent / Server configuration file:

KDEB_NOIFEXAM=1

In the same configuration file, specify the interface to be used to connect to the TEMS:

KDEB_INTERFACELIST=x.x.x.x

where x.x.x.x is the interface address which will be used.

Pre ITM control – KDCB0_HOSTNAME

A closely related control looks like this

KDCB0_HOSTNAME=xx.xx.xx.xx

This control has the identical meaning as

KDEB_INTERFACELIST=xx.xx.xx.xx

Most importantly KDCB0_HOSTNAME overrides any KDEB_INTERFACELIST setting. If you have

KDC0_HOSTNAME=xx.xx.xx.xx

KDEB_INTERFACELIST=!xx.xx.xx.xx

That is the exact equivalent to

KDEB_INTERFACELIST=xx.xx.xx.xx

Which can be extremely surprising, especially if you expect exclusive bind across many agents on some system.

You should remove KDCB0_HOSTNAME and replace with KDEB_INTERFACELIST if it is really needed.

Summary

This technote demonstrates how to configure ITM communications to avoid problems.

Sitworld: Table of Contents

Photo Note: Mud Creek Slide: 1 million tons of earth and rocks 2017/5/20

Appendix 1:  KDEB_INTERFACELIST time of troubles.

At ITM 622 FP8 and ITM 623 FP1 the logic was changed in a problematic way.

KDEB_INTERFACELIST=192.168.1.1

Was interpreted to be the same as KDEB_INTERFACELIST=!192.168.1.1.

KDEB_INTERFACELIST=192.168.1.1 192.168.1.2

Was also interpreted to be the same as KDEB_INTERFACELIST=!192.168.1.1. The second interface was totally ignored ignored.

At ITM 622 FP9 and ITM 623 FP2 the logic was somewhat restored but was still a problem.

KDEB_INTERFACELIST=192.168.1.1 192.168.1.2

Was interpreted just as it is stated, a simple ordering of which interface comes first, just as before [and afterwards.

However

KDEB_INTERFACELIST=192.168.1.1

Was still interpreted to be the same as KDEB_INTERFACELIST=!192.168.1.1.

At ITM 623 FP3 all those changes were completely reverted.

This might seem to be an ancient history footnote, however the basic services component is installed along with the OS Agent and in many cases never updated. You may still need to take these limitations into account on agents or ITM 622 central services. That ITM 622 level is going End of Service on 28 April 2018 but history shows use it sometimes persist. Also ITM 623 GA/FP1/FP2 are still supported.

If you find yourself with a system at these problematic levels and you cannot upgrade – what do you do?

If you are not using KDEB_INTERFACELIST at all there is nothing to do.

If you are using exclusive binds or + or – nothing special is needed.

Managing ITM 622 FP8 and ITM 623 FP1

The easiest thing to do is switch the KDEB_INTERFACELIST to exclusive binds all the way across all agents on that system.

Managing ITM 622 FP9 and ITM 623 FP2

If you have a one element ordering statement

KDEB_INTERFACELIST=xx.xx.xx.xx

The best way it just to specify a following address

KDEB_INTERFACELIST=xx.xx.xx.xx 127.0.0.1

Which will preserve the ordering definition. For an Agent that second ip address will never be referenced and so does no harm.

We strongly suggest you keep the OS Agents upgraded. If there are 32 bit agents, you will also have to upgrade using tacmd updateFramework to get all the pieces upgraded. The best reference I have found is here:

ITMAgents Insights: Upgrading back-level components using “tacmd updateFramework” or local silent install with response file.

 

Sitworld: ITM Agent Historical Data Export Survey

SunPillar

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 4 May – Level 0.51000

Follow on twitter

Introduction

The ITM Agent Historical Data Survey tool reports cases where the historical data export process has experienced a failure.The data is available for cases when the OS Agent on the system is at the ITM 630 maintenance level.

Background

ITM agents can collect historical data. At a user configured time rate, the data for specific attributes is collected in Short Term Historical [STH] files.  For most environments best practice is to collect the STH files at the agent. Periodically, the data is exported to a Warehouse Proxy Agent which in turn relays this the data to a data warehouse like DB2 or Oracle. This generally works quite reliably. However there are a large collection of potential failure cases. The agent may not be able to connect to the WPA. The local file system may fill. The STH file might be broken after an unexpected system stoppage. There are 154 error codes and just one indicates success.

One recent project Discovering Historical Data Export Problems at Agent showed how to create an situation to alert on problem cases. That is quite useful in maintaining a stable environment. However when starting out to clear all issues, all the alerts can be inefficient. Much better would be a report on the problem cases,At ITM 630 access was provided from the Agent Support Library or TEMA.

The following project presents a  historical data export report for all the agents using a ITM 630 TEMA.

It is based on the Agent Health Survey project which identifies potential ITM agents which are unhealthy – appearing online but unable to provide real time data or even run situations. That is used in many large ITM installations. The basic framework was reused in this project.

ITM Agent Historical Data Export Survey Report

Here is an example of an test on a fairly large set of Windows OS Agents.

sth1

A review of the agents found there was some issue getting the data – the diagnostic log showed errors. As a result on these OS Agents, there was no NTPROCESS data. When the export process found no data, it recorded a  Metafile not found. Normally there would be a NTPROCESS.hdr file and a NTPROCESS file. Neither were present and so the error code was set. This was only a 20+ of 5000 Windows agents but achieving 100% data capture is an excellent goal.

This post Sitworld: Discovering Historical Data Export Problems at Agent  includes a list of all the error codes. Involve IBM Support to determine the meaning and how to recover from a specific error.

ITM Agent Historical Data Export Survey Installation

The agent historical data export survey package includes one Perl program that uses CPAN modules. The program has been tested in several environments. Window had the most intense testing. It was also tested on AIX. Many Perl 5 levels and CPAN package levels will be usable. Here are the details of the testing environments.

The Activestate Perl used is 5.20.  If you make use of the blog CPAN library below, use the 5.20 version of that package.

  1. ActiveState Perl in the Windows environment which can be found here: http://www.activestate.com/activeperl/downloads

perl -v

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread (with 1 registered patch, see perl -V for more detail)

2) Perl on AIX 5.3

# perl -v

This is perl, v5.8.2 built for aix-thread-multi

(with 3 registered patches, see perl -V for more detail)

CPAN is a collection of free to use packages. In your Perl environment, there may be some installed CPAN modules and agent health survey may need more. Here are the modules used.

Getopt::Long              in CPAN Getopt-Long 2.42

LWP::UserAgent            in libwww-Perl 6.02

HTTP::Request::Common           in CPAN HTTP-Message  6.06

XML::TreePP;              in CPAN XML-TreePP 0.43

You might discover the need for other CPAN modules as the programs are run for the first time. The programs will likely work at other CPAN module levels but this is what was most recently tested.

The Windows Activestate Perl environment uses the Perl Package Manager to acquire the needed CPAN modules. The Agent Survey technote has an appendix showing usage of that manager program with screen captures.

Please note!!: In some environments installing new CPAN packages is a major problem. Internet access may not be available or Perl may be a shared resource which you do not have the right to change. Changing such packages could negatively affect other programs.

To manage this case please see the CPAN Library for Perl Projects which has a package which can eliminate changing the installed Perl libraries.

Package contents

The supplied program is itm_sth_survey.pl and a model sthsurvey.ini file in a zip file itm_sth_survey.0.51000.

To install this package, unzip or untar the file contents into a convenient directory.   The soap control is required [see later for discussion]. In this case the sthsurvey.ini file looks like this

soap <server_name>

user <user>

passwd <password>

The user and password credentials may be supplied from standard input. This increases security by ensuring that no user or password is kept in any permanent disk file. In this case the health.ini file would look like this:

soap <server_name>

std

The std option can also be supplied on the command line -std. In either case, a program must supply the userid and password in this form

-user <userid> -passwd <password>

The program invocation would be something like this

mycreds | perl …

ITM Agent Historical Data Export Survey Configuration and Usage

The Agent Historical Data Export Survey package has controls to match installation requirements but the defaults work in most cases. Some controls are in the command line options and some are in the health.ini file. Following is a full list of the controls.

The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.

command ini file default notes
-log log ./sthsurvey.log Name of log file
-ini ./sthsurvey.ini Name of ini file
-debuglevel 90 Control message volume
-debug off Turn on some debug points
-dpr off Dump internal data arrays
-h <null> Help messages
-v verbose off Messages on console also
-vt traffic off Create traffic.txt [large]
-pc pc <null> Limit survey by agent types
-tems tems <null> Limit survey by TEMSes
-agent agent <null> Agents to survey
-agent_list agent_list <null> text file with agents to survey
-ignore_list ignore_list <null> text file with agents to ignore
-all all off Produce report of all agents
-agent_timeout agent_timeout 50 TEMS to Agent wait
n/a soap_timeout 180 Wait for soap
-o o ./sthsurvey.csv Output report file
-workpath workpath <null> Directory to store output files
n/a soap <required> SOAP access information
n/a soapurl <null> Recognized – use soap
-std std Off Userid/password in stdin
-user user <required> Userid to access SOAP
-passwd passwd null Password to access SOAP

Many of the command line entries and ini controls are self explanatory. The following options can be set multiple times:  -pc and -tems and -soap. All time base settings are in seconds.

soap specifies how to access the SOAP process with the name or ip address of the server running the hub TEMS. See next section for a discussion.

soapurl specifies how to access the SOAP process including the protocol and port number and target.

soap_timeout controls how long the SOAP process will wait for a response. One of the agent failure modes is to not respond to real time data requests.  This default is 180 seconds. It might need to be made longer in some complex environments. A value of 90 seconds resulted in a small number of failures [2 agents] in a test environment with 6000 agents.

-agent specifies specific agents to survey and can be set multiple times. -agent_list gives a filename which contains agents to survey. If both are present in command and/or ini file the effect is cumulative.  If -agent or -agent_list is used, you usually do NOT want to use -tems or -pc since those will eliminate some of the specified agents.

If the -agent_list has an entry which begins with a circumflex ^ [shift 6] the entry is considered a regular expression. The ^ character is the beginning of line anchor. If you specify ^abc the the managed systems which begin with “abc” will be considered of interest. If you wanted Linux OS Agents which began with abc you would use ^abc.*:LZ. That allows you to create a report on agents of interest based just on the name.

Controls to include [like -pc and -tems] and exclude [like -ignore_list] will operate independently. It is best to minimize the number of controls and test thoroughly so you can avoid surprising results.

Command lines supplied are printed in the report, however the -user and -password values are replaced by UUUUUUUU and PPPPPPPP.

ITM Agent Historical Data Export Survey Package soap control

The soap control specifies how to access the SOAP process. For a simple ITM installation using default communication controls, specify the name or ip address of the server running the hub TEMS. If you know the primary hub TEMS a single soap control is least expensive.

If the ITM installation is configured with hot standby or FTO there are two hub TEMS. At any one time one TEMS will have the primary role and the other TEMS will have the backup role. If the TEMS maintenance level is ITM 622 or later, set two soap controls which specify the name or ip address of each hub TEMS server. The TEMS with the primary role will be determined dynamically.

Before ITM 622 you should determine ahead of time which TEMS is running as the primary and set the single soap control appropriately.

Connection processing follows the tacmd login logic. It will first use https protocol on port 3661 and then use http protocol on 1920. If the SOAP server is not present on that ITM process, a virtual index.xml file is retrieved and the port that SOAP is actually using is retrieved and used if it exists.

Various failure cases can occur.

  1. The target name or IP address may be incorrect.
  2. Communication outages can block access to the servers.
  3. The TEMS task may not be running and there is no SOAP process.
  4. The TEMS may be a remote TEMS which does not run the SOAP process.
  5. The SOAP process may use an alternate port and firewall rules block access.

The recovery actions for the various errors are pretty clear. If (5) is in effect, consider running the survey package on a server which is not affected by firewall rules. Alternatively, always make sure that the hub TEMS is the first process started. If it must be recycled, then stop all other ITM processes first and restart them after the TEMS recycle. See this blog post which shows how to configure a stable SOAP port at the hub TEMS.

If the protocol is specified in the soap control only that protocol will be tried.

soap https://<servername&gt;

When the port number is specified in the soap control, 3661 will force https protocol and 1920 will force http protocol.

soap <servername>:1920

The ITM environment can be configured to use alternate internal web server access ports using the HTTP and HTTPS protocol modifiers. For this case you can specify the ports to be used

soap https://<servername&gt;:4661

or if both have been altered

soap https://<servername&gt;:4661

soap http://<servername&gt;:2920

The logic generally follows tacmd login processing. There are two differences: ipv6 is not supported and port following ITM 6.1 style is not included. SOAP::Lite does not support ipv6 at present. ITM 6.1 logic could be added but is relatively rare and was not available for testing.

ITM Agent Historical Data Export Survey Install Validation Test

Start with a short run. The goals here are

  1. Ensure Perl is installed with the needed CPAN packages
  2. Validate SOAP communication controls
  3. Access and review of the hub TEMS tables
  4. Access and review of agent operations logs.
  5. Clear observed problems

Here is an example command

perl itm_sth_survey.pl -v -tems <tems_name> -pc ux

The -v option writes all the log messages to the screen. The -tems option specifies a tems where agents report to. The -pc option says what agents to study. Later on you can specify multiple -tems and -pc options.

Here is a second example command where the externally supplied CPAN modules have been installed in the directory inc. In addition all the output files are written into the /tmp directory.

perl -Iinc itm_health_survey.pl -v -tems <tems_name> -pc ux -workpath /tmp

ITM Agent Health Survey Intensive Debug trace

When the itm_survey.pl program does not produce correct results or stops unexpectedly, you should gather addition documentation. The -debuglevel 300 option will generate an extensive log trace. The survey.log will be much larger than normal and thus the survey should be limited.

The -vt or traffic option dumps the http data to a traffic.txt file. This can be extremely large and should be used only on a limited basis. In one case a 10,000 agent survey generated a 2 gigabyte file.

ITM Agent Historical Data Export Survey Limitations

http6 and http6s protocols are not yet supported.

Summary

The Agent Historical Data Export Survey tool was derived from Agent Health Survey.

Sitworld: Table of Contents

Feedback Wanted!!

Please report back experience and suggestions. If  Agent Health Survey does not work well in your environment, repeat the test and add  “-debuglevel 300” and after a retry send the health.log [compressed] for analysis.

History and Earlier versions

If the current version of the Agent Health Survey tool does not work, you can try recent published Health Survey binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

itm_sth_survey.0.51000

Initial Level

Photo Note: Sun Pillar during sunset off Carmel Highlands [ref]. Credit to my neighbor  Stephen Adair who took the photograph.

 

Sitworld: Discovering Historical Data Export Problems at Agent

cruise1

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Introduction

ITM 6 has a marvelous ability to collect historical data. Best practice is to collect the historical data at the TEMA or Agent and then export the data to the Warehouse Proxy Agent which then forwards the data to the data warehouse. With a large number of agents almost anything can go wrong and require fixing. Identifying the problem cases has been challenging. A few years ago a new TEMA attribute group was added to expose the last export status. This can be used in a situation formula and generate situation alerts to problem cases. This post shows exactly how to do that. An appendix at the end lists all the current status codes. In some cases you can resolve it yourself, in other cases IBM Support will be involved.

Step by Step Situation Development

Right click on a TEP navigation node such as Linux OS under a test Linux System. Select Situations… Click on the new Situation action. Enter a situation name in the dialog box. If not matching what you want, also enter Monitored Application.

export1

Next click OK and define the attribute group [ITM Historical Exports] and attribute item [Last Export Status].

export2

Click OK. For the first experiment set the test to be == 0, meaning alert when things are working as expected.

export3

Click on Advanced,Display Item and select Collection Identifier.

export4

Now make sure situation is distributed to your test system and OK out. The situation should start immediately and in the Situation Event Console you will see

imageexport5

For the next steps you will likely want to test for Last Export Status not equal to zero. Next you will expand the distribution to more agents, like all Linux agents.

Last Export Status – what it means

The appendix has a list of all currently known export status values. 0 means success and can be ignored for this purpose.

One common one is 26

   CTX_MetafileNotfound,            26

In general this means that historical data was configured, but no data was every collected. The example studied closely was some Linux LPAR attribute group, but the Linux system being looked at did not have any LPAR capability. For this site, we recommended that the attribute group should not be collected. Another way to avoid alerting would be to extend the formula to exclude the 26 status.

Other errors may lead to obvious conditions – like an inability for the agent to connect to the WPA or maybe a nearly full mount point. It any case you need to investigate and resolve… with IBM Support if need.

Other possibilites

Another common issue involves unhealthy agents – online but not responding and not running situations. Here is a blog post and program to help track them down:

Sitworld: ITM Agent Health Survey

These cases will not be running the historical data UADVISOR situations, which have the same effect. However they will not alert because no situations are running.

Summary

This shows how to create a situation to alert on some Historical Data collection and export problems.

Sitworld: Table of Contents

Appendix 1: Historical Data Export Error codes

KRAHIST – ITM History Exort – Last Export Status

CTX_Success = 0,                     0

CTX_InvalidParameter,             1

CTX_InvalidOutputFormat,          2

CTX_NoMemory,                     3

CTX_InternalError,                4

CTX_LogonFailed,                  5

CTX_InvalidUserid,                6

CTX_InvalidPassword,              7

CTX_ConnectionFailed,             8

CTX_TargetTypeUndetectable,       9

   CTX_EmailSendFailed = 10,        10

CTX_InvalidRecipType,            11

CTX_SMTPError,                   12

CTX_OLEInitializationFailed,     13

CTX_InvalidInitCall,             14

CTX_SessionStartupFailed,        15

CTX_CMSConnectFailed,            16

CTX_DSError,                     17

CTX_EndOfData,                   18

CTX_InvalidDataType,             19

   CTX_ODBCError = 20,              20

CTX_TableNotFound,               21

CTX_ParmDataLengthError,         22

CTX_InvalidParameterFormat,      23

CTX_InvalidExportType,           24

CTX_MetafileFormatError,         25

CTX_MetafileNotfound,            26

CTX_MetafileIOError,             27

CTX_MetafileCloseError,          28

CTX_HistoryIOError,              29

   CTX_HistoryFileNotfound = 30,    30

CTX_HistoryCloseError,           31

CTX_SocketIOError,               32

CTX_GetHostnameError,            33

CTX_SMTPConnectError,            34

CTX_GetHostByNameError,          35

CTX_GetServByNameError,          36

CTX_SocketError,                 37

CTX_SocketServerResponseError,   38

CTX_ColumnsNotBound,             39

   CTX_ColumnsAlreadyBound = 40,    40

CTX_NotExporterMessage,          41

CTX_InvalidSocketBufferLength,   42

CTX_SocketBufferOverflow,        43

CTX_MessageParseError,           44

CTX_SchemaFormatError,           45

CTX_OLEError,                    46

CTX_RequestRouted,               47

CTX_NoListenTask,                48

CTX_RPCError,                    49

   CTX_SpreadsheetNotFound = 50,    50

CTX_IncompatibleSpreadsheet,     51

CTX_NoObjectResolution,          52

CTX_ServerDied,                  53

CTX_PDSLoadError,                54

CTX_PDSNotAvailable,             55

CTX_RPCRequestHandleError,       56

CTX_PropertyNotFound,            57

CTX_NoProperties,                58

CTX_EnvNo,                       59

   CTX_InitJVMError = 60,           60

CTX_JavaError,                   61

CTX_JDBCError,                   62

CTX_TempFileError,               63

CTX_ColumnLengthError,           64

CTX_PThreadError,                65

CTX_CfgFileError,                66

CTX_ConfigRecNotFound,           67

CTX_SkipRecord,                  68

CTX_InvalidRPCFunction,          69

   CTX_ConfigOpenError = 70,        70

CTX_ConfigCloseError,            71

CTX_RedriveExport,               72

CTX_WarehouseProxyNotRegistered, 73

CTX_GLBUnavailable,              74

CTX_InitializationFailed,        75

CTX_ParseError,                  76

CTX_SQLFileNotFound,             77

CTX_RTNCustIDNotSet,             78

CTX_NotConnected,                79

   CTX_GetJavaVMInitArgsFailed = 80,80

CTX_CreateJavaVMFailed,          81

CTX_GetJavaMethodFailed,         82

CTX_FindJavaClassFailed,         83

CTX_CompressionError,            84

CTX_InvalidOracleODBCDriver,     85

CTX_DataTypenameUnavailable,     86

CTX_RPCInterfaceRegisterError,   87

CTX_RouteNotifyMismatch,         88

CTX_PDSOpenInputNoRecords,       89

   CTX_NotFound = 90,               90

CTX_NameConversionFailed,        91

CTX_NameCompatibility,           92

CTX_HistorySaveError,            93

CTX_HistoryDeleteError,          94

CTX_MetaSaveError,               95

CTX_MetaDeleteError,             96

CTX_PDSSetupError,               97

CTX_DecodeError,                 98

CTX_BeginExport,                 99

   CTX_ExportInProgress = 100,     100

CTX_NoDataFound,                101

CTX_ShutdownRequested,          102

   CTX_UCS2TranslationError = 200, 200

CTX_InvalidDatabaseEncoding,    201

CTX_ColumnNotFound,             202

CTX_ExistAlready,               203

CTX_InvalidList,                204

CTX_Reserved,                   205

CTX_Not_Reserved,               206

   CTX_No_More_Connection = 210,   210

CTX_Not_Initialized ,           211

CTX_DB_Not_Connected,           212

CTX_AggregationTable,           213

CTX_OracleInternalTable,        214

CTX_Cnx_Null,                   215

CTX_ServerTimeout,              216

CTX_ServerTimeout_BeforeCommit, 217

CTX_Table_Altered,              218

CTX_DBError,                    219

CTX_SampleCommitError,          220

CTX_SampleLogStatusError,       221

CTX_QueueNotInitialized,        222

CTX_QueueFull,                  223

CTX_QueueStopped,               224

CTX_JavaEnv_Null,               225

   CTX_NoNeedToRenameFile,         226

CTX_SkipWrite,                  227

CTX_InitializationWarning = 300,300

CTX_EncodingError,              301

CTX_AddressError,               302

CTX_InitFailedRecently,         303

CTX_Table_Deleted,              304

CTX_Wrong_Schema,               305

CTX_RenameConfigFileError,      306

CTX_RemoveConfigFileError,      307

CTX_ConfigFileEmptyError,       308

   CTX_SelectSiteError = 320,      320

CTX_GetCurrentCMSAddressError,  321

CTX_RemoveMetaFileError = 325,  325

CTX_MetaFileSizeError,          326

   CTX_Failed_Batch_Row_Not_Found=350, 350

CTX_RowExistAlready,            351

CTX_ODBC_DSN_Exceed_Max_Len,    352

CTX_Compress_Failed,            353

CTX_UnCompress_Failed,          354

CTX_Compression_Warning,        355

   CTX_TableSpaceNotFound=400,     400

CTX_TableSpaceTooSmall,         401

CTX_TableSpaceNotOnline,        402

CTX_IndexTableSpaceInvalid,     403

   CTX_SequenceNotFound=450,       450

   CTX_HistoryRenameError=500,     500

   CTX_Restore_Primary_WPA=550     550

Photo Note: New Cruise Ship Buffet 2016

 

Sitworld: FTO Configuration Audit

coastal

Version 0.81000 17 March 2017

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

On several recent cases,  the hub TEMS randomly became inoperative. After long study and diagnostic data collection, the conclusions were that the Fault Tolerant Option [FTO or Hot Standby or Mirror] configuration was incorrect. In one case several z/OS remote TEMSes were missing the CMS_FTO=YES control. In another, the distributed remote TEMS glb_site.txt file had one entry that pointed to another remote TEMS instead of the two hub TEMSes as required. These efforts took several months to discover and test, So I decided this aspect was ripe for an audit tool. That way any customer can make sure their FTO setup is configured correctly

Background

FTO works by having two hub TEMS configured together. At any one time one hub TEMS takes primary role [first one to start] and the other hub TEMS takes a backup role. There is a TEMS-to-TEMS conversation and new user data is propagated from the hub TEMS in primary role to the hub TEMS in backup role. The backup hub TEMS actually it accepts remote TEMS and Agent connections but shortly after tells them to “find another TEMS” and disconnects. At most recent levels it doesn’t run any situations.

The remote TEMS logic is simpler. First if FTO is not being used [CMS_FTO=NO or not defined] then at startup the glb_site.txt entries show what hub TEMS might be there. Each one is tried in turn until a successful connection is made. From then on that is the only hub TEMS that will be connected to until the next remote TEMS startup.

Second if FTO is being used [CMS_FTO=YES] the same initial logic is followed to find a working hub TEMS. The difference comes after a loss of hub TEMS connection: at that time the logic starts looking again for a working hub TEMS. In that way it will find the new hub TEMS in primary role after a switch over.

If the FTO configuration is not identical across all hub and remote TEMSes, things won’t work. The big surprise is how badly things fail, including hub TEMS breaking.

The rest of this post presents a new tool which will perform all the needed checking and report on discrepancies. The cases where a manual check is needed is also documented. By using this tool you can validate the configurations are correct and fix any issues before experiencing outages. Or, if you suspect this issue, you can rule it in or out quickly.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

TEPS Audit has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

This rool runs on the same system as a TEPS connected to the current hub TEMS.

A zip file is found found ftoaudit.0.81000. There is one file ftoaudit.pl.

Run Time Options

Options:

-h                           [optional] supply ITM installation directory if not default. You can also [Windows SET CANDLE_HOME=xxxxx or Linux/Unix export CANDLEHOME=xxxxx before starting ftoaudit.pl.

-v                           show log messages during process

-debug                    run in debug mode

-debuglevel             default 99. If set to 300 log file is more detailed.

-work                     default C;\TEMP or /tmp – where to store report and log and working files –

-o                          default ftoaudit.csv – name of report file

Report Limitations

This logic will recover and cross-check all the environment variable CMS_FTO values.

The glb_site.txt checking works only on Windows/Linux/Unix remote TEMS and only when there is an OS Agent active on the same system.

Any z/OS remote TEMS will need manual checking. The KDCSSITE member is equivalent to the glb_site.txt. KDSENV will contain the CMS_FTO setting, if present.

FTO Configuration Audit Report

Here is a sample report. with interspersed comments

FTO Configuration Audit Report – Version 0.80000

Primary Hub TEMS – HUB_NMP180

Backup Hub TEMS – HUB_NMP182

==> lists the detected primary and backup hub TEMSes. If this is wrong, maybe the TEPS is not connected to the FTO primary hub TEMS,

Impact,Advisory Code,Object,Advisory

100,CMSFTO1006E,HUB_NMP180,Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

===> See following for list of all advisory messages

Remote TEMS glb_site.txt report

remote_tems,product,osagent,glb_site.txt

REM_NMP183,LZ,nmp183:LZ,ip.pipe:nmp180x|ip.pipe:nmp182|,

REM_NMP184,LZ,nmp184:LZ,ip.pipe:nmp180|ip.pipe:nmp182|,

===> note how the NMP183 has an extra added “x” where I forced an error.

Elapsed Time report hub TEMS 2.82865595817566

tems,var_elapsed,glb_elapsed,

REM_NMP183,2.79777908325195,2.81095504760742,

HUB_NMP180,3.29944014549255,2.84404110908508,

REM_NMP184,2.90863513946533,2.69522094726562,

HUB_NMP182,2.78772282600403,3.08135104179382,

===> Above report section is interesting and may detect cases of high latency between hub TEMS and other TEMSes. The elapsed time is larger than you might expect because there is a java startup close in the KfwSQLClient utility that gets used.

===> The end of the report contains an explanation of the advisory messages.

Advisory Trace, Meaning and Recovery suggestions follow

Advisory code: CMSFTO1006E

Text: Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

Impact: 100

Meaning: In FTO configuration remote TEMSes need to have a

configuration that specifies the two hub TEMS. These two hub

TEMSes are defined during configuration and the result is stored

in the glb_site.txt file.

This files will normally be identical. If they are not identical

then the FTO logic will break.

A following report section will detail the contents of each

glb_site.txt which should be thoroughly reviewed. It is possible

for differences to be present, such as one that uses resolvable

names and others that use ip addresses and all is well. More

commonly one or more is just referencing an incorrect address…

most are OK and some are wrong. In this case FTO logic will

break and this can cause hub TEMS instability and crashes.

Errors in the DNS resolving system or /etc/hosts file could make

the results inconsitent even though it looks OK.

The data is available if there is an OS Agent running on

the same system as the remote TEMS. In that case, the remote

TEMS glb_site.txt should be reviewed manually.

Recovery plan: Review the glb_site.txt report and reconcile

any differences. That usually means re-configuring the remote

TEMS.

Advisory Messages

CMSFTO1001W – Hub TEMS running FTO but no Backup hub TEMS found

CMSFTO1002E – Hub TEMS running FTO but Backup hub TEMS [tems_nodeid] not running FTO

CMSFTO1003E – Hub TEMS running FTO but remote TEMS [tems_nodeid] not running FTO

CMSFTO1004W – Hub TEMS not running FTO but a Backup hub TEMS[tems_nodeid] was found

CMSFTO1005E – Hub TEMS not running FTO but remote TEMS [tems_nodeid] is running FTO

CMSFTO1006E – Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

CMSFTO1007E – TEMS running with KGLCB_FSYNC_ENABLED=0: risk of database file damage and TEMS outage

*note* This is unrelated to FTO but it is concerning on any Linux/Unix system.

In the report itself, if an advisory is produced, the end of the report includes the impact and a discussion and a recovery plan. If this is unclear you can always contact IBM Support.

Summary

Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.

Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

ftoaudit.0.81000
Add check for non-TEPS system

Sitworld: Table of Contents

Note: View from Nepenthe Restaurant, Big Sur California

 

Sitworld: Portal Client [TEP] on Windows Using a Private Java Install

MandiAndBanjo

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 28 December 2016 – Level 1.00000

Follow on twitter

Inspiration

Giving Credit! This blog post was created after a customer Nathan Posey at Citibank emailed me the general scheme. They deserve 95% of the credit and I will take any blame.

Once more a client called when Portal Client [TEP] no longer worked because a new Java RTE was installed by system administrators. This usually takes several days or a week+ to resolve. The underlying issue is that TEP is developed and tested against specific Java RTE levels – usually the more current ones. As time passes new Java RTEs are published and they get installed en masse. When all goes well, no one notices. But when things break they break hard and that causes a crisis and a support issue for the customer and IBM also. This document shows how to avoid that issue by using a non-installed java rte.

If you are interested in reviewing TEP/Java problems ahead of time: Master list of ITM TEP/Java issues.

Security Alert!!!

The technique described here documents a good way to run an ITM Portal Client in Java Web Start mode. That will be standard practice as time goes on because all the major browser vendors and even Oracle is eliminating the Browser applet mode of operation from future Java versions.

This technique shows how to deliver and use Java without a Java Windows install. The result is a private version of Java used only for ITM Portal Client. In this mode ITM usage is effectively insulated against system level Java updates.

However this could be considered a security exposure. For Portal Client or TEP a private Java RTE is used to perform the Java Web Start process. It so happens that on Windows Java can be used in this way without going through the install process. The benefit is increased usage stability. On the other hand, this private version of Java may have security problems and it will continue to run even though the system Java has been upgraded. This *might* be considered a security exposure and any customer using this method should clear it with their security team and management. The balancing act here is better end user stability versus the theoretical security problems while this one particular application is running.

If your security folks do not approve this usage, then you will have to endure reduced end user stability.

Creating A Private Java zip file on Windows.

I expect most end user Windows environments will be 64 bit and so the instructions will follow that path. The example will use a Java RTE which comes as part of the TEMS media image but any Java RTE can be used which works with TEP. As always testing thoroughly is suggested After TEPS install on Windows on the default C:\IBM\ITM directory these files will be seen:

Directory of C:\IBM\ITM\CNB\java

07/04/2015  08:18 AM        85,476,984 ibm-java6.exe

07/04/2015  08:19 AM        65,526,355 ibm-java6.rpm

07/04/2015  08:19 AM        65,441,752 ibm-java6.tgz

07/04/2015  08:18 AM        97,035,904 ibm-java7.exe

07/04/2015  08:18 AM        77,682,924 ibm-java7.rpm

07/04/2015  08:20 AM        81,675,376 ibm-java7.tgz

07/04/2015  08:19 AM       111,716,040 ibm-java7_64.exe

07/04/2015  08:19 AM        84,052,070 ibm-java7_64.rpm

07/04/2015  08:19 AM        89,954,859 ibm-java7_64.tgz

In this case we will be using ibm-java7_64.exe: IBM version of java version 7 (build pwi3270sr9fp40-20160422_01 (SR9 FP40))

This is the exact version that would be installed on an end user system the first time TEP is made use of. Thus this version is definitely one that has been well tested and has a lot of customer experience in the user base.

On a 64 bit Windows system, you run the install by copying that file into the environment and double clicking on it. This could be performed on the system running the TEPS.

If you need to create a 32 bit version, this will need to be performed on a 32 bit Windows system using ibm-java.exe.

You see a normal sort of dialog boxes selecting language, target etc. One early dialog looks like this

tep1

Later a dialog box shows you the target install directory and note it down.

tep2

When asked

tep3

Click on No!!

When the install is complete, use Windows Explorer to locate the install directory

tep4

Now right-click on the Java70 directory and select Send to/Compressed (zipped) Folder. The compressed folder cannot be in that Program Files directory so select the default target of the desktop. It will look like this.

tep5a

Save that file somewhere because that is what the user will employ during the install.

Installing the TEP Java image – Performed by End User

You will likely tailor these instructions to your company standards… and also set actual target TEPS addresses.

On the end user Windows system, create a directory c:\apps_local. Copy the Java70.zip into that directory. If you can find a way to automate the following process all the better… and please tell me more about it.

tep6

Next right-click on Java70.zip and select extract to Java70\

tep7

Next create one shortcut for the TEPS target and a second shortcut to access the Java Control Panel.

On the Desktop, right click on an open area and select new and then Shortcut. You will see this

tep8

If you are using exactly the same directory as these instruction then enter

C:\apps_local\Java70\Java70\jre\bin\javaws.exe

And click Next and enter a name that is probably more meaningful than this example.

tep9

Click Finish and see this

tep10

To complete this process right click on that new icon and select Properties

tep11

To complete the process the Target setting needs to be updated to set the TEPS target. For my test system the result would look like this:

C:\apps_local\Java70\Java70\jre\bin\javaws.exe “http://nmp180.SVL.IBM.COM:1920///cnp/kdh/lib/tep.jnlp”

Then OK out.

There is second shortcut that is needed. You start exactly the same way but give it a name Java CP – for Java Control Panel. In this case the Target setting is

C:\apps_local\java\IBM\Java70\jre\bin\javacpl.exe

This will be used to handle cases where the java cache needs to be cleared, or if tracing needs to be set.

If there is more than one TEPS target, create a separate icon for that .

Note: The cache seems to be saved here for the IBM Java if you need to look at traces/logs:

C:\Users\<USERID>\AppData\LocalLow\IBM

Summary

This document how to install a private version of Java RTE to run a tested version of Portal Client or TEP. When this has been set up, the system Java can be updated without losing access to Portal Client.

Sitworld: Table of Contents

History and Earlier versions

1.00000

Initial publication

Photo Note: Mandi and Banjo – December 2016 – Mandi is short for Mandolin [thanks PV]

 

Sitworld: TEMS Database Repair

ginny1

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #5 – 10 December 2018 – Level 1.04000

Follow on twitter

Introduction

The TEMS database tables are used to store user data such as situation descriptions and distribution definitions. They also keep running data such as current situation status on agents. There many more internal and functional tables.

When the files holding the data are damaged and the TEMS usually malfunctions. Over the years there have been many reasons for such damage. Here are some examples

  1. TEMS exception and failure.
  2. File system full.
  3. Unwise manual changes or restoring from a backup that wasn’t taken correctly.
  4. Power outage without any UPS backup.
  5. SAN [Storage Access Network Device] failure.
  6. System shutdown without stopping the TEMS.
  7. Many unexplained instances.

Hub TEMS Recovery Attempt WARNING!!!

A primary hub TEMS is the repository of fundamental user data and any recovery of that is a delicate operation which can easily result in a reinstall and significant downtime. Please work with ITM support in planning a hub TEMS data recovery. Remote TEMS can be recovered quite simply as can a FTO mirror hub TEMS.

In addition you should have a Backup/Recovery plan for hub TEMS data. See this document for five different ways to accomplish this goal. A simple backup of the files while the TEMS is running is inadequate and can lead to significant downtime. These are hot database files and many constantly change and are tightly connected.

Non-Hub TEMS Recovery

The process is very simple although it varies by platform [hardware and operating system] and by TEMS maintenance level. From a high level view you stop the TEMS [if running], replace the database files with emptytable files and then start up the TEMS and let the hub TEMS refill with correct data naturally. A reference to the files follow. They are not exactly empty. At the very least they contain a “end of objects” record and some are pre-loaded with data. The ones here were accumulated from install media builds from ITM 6.2, ITM 621, ITM 622, itm 623 and ITM 630. They are the exact files you would lay down during a new TEMS install.

There are three  types of files:

  1. Bigendian – for Unix [AIX/Solaris/HPUX] and Linux on Z
  2. Littleendian – for Linux/Intel and Windows
  3. VSAM – z/OS index sequential file

The references here are to a zip file for each maintenance level. Each zip file contains a bigendian.tar file [Unix and zLinux], a littleendian.tar file [for Iinux/Intel] and a  littleendian.zip file for Windows. The last two contain identical files but are packaged differently for convenience. With z/OS the story is quite different, see later.

  1. ITM620_emptytables
  2. ITM621_emptytables
  3. ITM622_emptytables
  4. ITM623_emptytables
  5. ITM630_emptytables

Windows Recovery for non-hub TEMS

  1. Select the correct maintenance level and load the proper zip file from the links above. Unzip that file and you will use the .zip file included.
  2. Unzip that file into some convenient directory – we will assume C:\TEMP but it can be anyplace. You will see a lot of QA1*.DB file and QA1*.IDX files.
  3. Stop the TEMS
  4. Copy the files, for example [adjust for actual install directory]

cd c:\IBM\ITM\cms
copy c:\temp\QA1*.*

You could also use Windows explorer. You may also wish to make a safety copy of those files.

  1. Start the TEMS
  2. Monitor for correct operation.
  3. Recovery complete

Linux/Unix Recovery for non-hub TEMS

  1. Select the correct maintenance level and load the proper zip file from the links below. Most environments will have a gunzip command.  If not you can unzip on some convenient Windows workstation.
  2. Select the proper endian type. Bigendian is for all Unix and Linux on z systems. Littleendian is for all Linux/Intel systems. For this example we use linux at ITM 630 and the file is  ITM630_emptytables.littleendian_inux_intel.tar and it is assumed to be copied to /opt/IBM/ITM/tmp
  3. Move that littleendian file to the system where the TEMS runs and un-tar it.cd /opt/IBM/ITM/tmptar -xf ITM630_emptytables.littleendian_linux_intel.tar

    This will create many QA1* files

  4. At this point you have to determine the attributes/owner/group the current TEMS files. You could do that with these commandsls -l /opt/IBM/ITM/tables/<temsnodeid>/QA1CSTSH.DBwhich in my zLinux test environment looks like this:

    nmp180:~ # ls -l /opt/IBM/ITM/tables/HUB_NMP180/QA1CSTSH.DB

    -rwxr-xr-x 1 root root 35274789 Nov 14 21:03 … QA1CSTSH.DB

    [Above line shortened for display purposes.

  5. Next change the un-tar’d files to what is currently being used and what the TEMS expects. Remember the following is just an example that would be used in my environment. You will run the command appropriate to your actual environment,cd /opt/IBM/ITM/tmpchmod 755 QA1*.*

    chown root QA1*.*

    chgrp root QA1*.*

  6. Next stop the remote TEMS or FTO mirror hub TEMS
  7. Next copy the emptytable files into the directory where the  stopped TEMS expects themcd /opt/IBM/ITM/tables/<temsnodeid>cp /opt/IBM/ITM/tmp/QA1*.* .Note the trailing period which means to copy to the current directory.
  8. Next start the remote TEMS or FTO mirror hub TEMS
  9. Monitor for normal operations
  10. End of recovery
  11. Warning for the FTO mirror hub TEMS: When performing this operation *always* start the primary hub TEMS first [if not already running]. The refreshed FTO mirror hub TEMS must be started second. If that rule is violated the primary hub TEMS will have all custom objects deleted. Don’t do that.

z/OS recovery for non-hub TEMS

Please note: this is hardly ever needed. The last PMR I worked on *looked* like it was needed but the symptom was actually a harmless TEMS message [actually a defect] that complained about a table… and there was no actual problem at all! So I expect it is very rare to have to do this procedure.

Always involve IBM Support if you have any uncertainty at all in this process. Also, if you *think* you know more about z/OS than the author – you are very likely correct!!

z/OS recovery example with ICAT configuration

The following uses QA1CSTSH as an example.

1) Stop the TEMS task

2) Delete or rename the QA1CSTSH VSAM dataset. If unsure, examine the Joblog output to determine the complete dataset name.

3) Proceed to ICAT and navigate to the ‘Runtime Environments’ panel (KCIPRTE)

4)  Place a ‘B’ next to the RTE [Run Time Environment] that contains the TEMS that owns the file you wish to recreate.

5)  That will generate the DS#1xxxx job which should then be submitted.

6) The job will detect the file that is missing and recreate ONLY that file.

7) The job should complete with condition code zero

8) The TEMS can then be started.

z/OS recovery with PARMGEN configuration

The general idea is the same as ICAT.

For steps #3 – #7, it can be replaced w/ similar instructions here. That documents how to reallocate PDS files but the path followed is the same. Following are some notes from the Parmgen expert.

The job would vary – you can use KDSDELJB as a model job that has the deletes but only make it specific for RKDSSTSH VSAM

(//QA1CSTSH DD DISP=SHR,DSN=&RVHILEV..&SYS..RKDSSTSH.)

Submit the composite KCIJPALO job same as in the doc., and for the standalone job, refer to the PARMGEN KDSDVSRF – needs to be modified of course.

Hub TEMS – if you absolutely have no choice

There are many TEMS hub database tables which you can reset only by losing significant data and undergoing a long manual reinstall and rebuild. This could mean a week or more of outage. It is very important to involve IBM Support if you have any doubts at all.

However there are a few tables which can be reset with no real impact. These 5 sets of tables contain internal processing data, not user data.

  1. TSITSTSC – QA1CSTSC: The Situation Status Cache which is reused every time the TEMS starts.
  2. TSITSTSH – QA1CSTSH: The Situation Status History. This is an intermediate file where situation event status collect. It is a wraparound table and defaults to 8192 rows. At hub TEMS startup all the remote TEMSes and agents [if directly connected] send current status. Therefore you only miss situation status history after a reset. Since there are no ITM functions which display or use the history, nothing much is lost by resetting it to emptytable status.
  3. [several tables] – QA1CDSCA: This is the combined catalog table. If this is reset to emptytable status, at TEMS startup the pre-defined data is updated based on the existing package [like klz.cat] files. Therefore it can be reset to emptytable status and nothing is lost. As a minor point, TEMS has an extremely hard limit of 512 packages. At 513 the TEMS will crash and not come up. It is pretty rare but definitely something to keep aware of. Should you encounter this issue, you will have to remove one or more .cat [and the paired .atr] file to get the total down to 512 packages or below. If you encounter this limit see Sitworld: Attribute and Catalog Health Survey which will calculate what packages are no longer being used.
  4. SITDB/TOBJCOBJ – QA1CRULD/QA1CCOBJ: These tables are created dynamically as situations are started. SITDB contains the SQL representing the situation. TOBJCOBJ records how situations are related to each other. In any case the data is created dynamically as situations started. Both need to be reset to emptytable status at the same time.
  5. TNODESAV – QA1DNSAV: This records the current agent registration – the nodes or managed system names. When agents connect the data is rebuilt and also any missing data in the TNODELST table. This is sometimes shows as advisories in Database Health Checker reports and the agents affected do not actually run situations. One factor to consider is that agents which are temporarily offline will no longer be in the table. When they do connect again they will be present as usual.  If that is important you should capture that information before performing the replacement.

In each case you would do the same as a complete replacement but only handle the QA1*.DB and QA1*.IDX file.

Backup/Recovery best practice

The following document was co-authored with L3 TEMS and represents the best current thinking. It gives five ways to create a valid useful and reliable backup of the TEMS database files.

Best Practice TEMS Database Backup and Recovery

Summary

This document shows how to repair many cases of damaged TEMS database files.

Sitworld: Table of Contents

History and Earlier versions

1.00000
Initial publication

1.01000
Correct credit name for photo

1.02000
Add information about two more tables that be reset to emptytable status at the hub TEMS.

1.03000
Add warning about not starting refreshed FTP mirror hub TEMS first.

1.04000
Rename the emptytable files including the platform type – to reduce mistakes.

Photo Note: Ginny – A magnificent Maine Coon Cat that lives in Germany [thanks to IBMer Jens Helbig]

 

Sitworld: The Encyclopedia of ITM Tracing and Trace Related Controls

JealousOctopus (1)

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #7 – 7 September 2017 – Level 1.07000

Follow on twitter

Introduction

ITM tracing is at once the most tiresome of topics and sometimes the most important. This post will collect everything I have collected and discovered over the years. Expect to see many additions and corrections over time.

Chapter 1 is dedicated to controlling diagnostic log file sizes.

Chapter 2 is dedicated to describing operation log files – what they are and where they are found.

Chapter 3 defines the diagnostic string.

Chapter 4 defines communication tracing with an important warning about that.

Chapter 5 presents defining static traces.

Chapter 6 presents tacmd settrace.

Chapter 7 presents the Service Console method of setting and removing diagnostic traces.

Chapter 8 presents a z/OS only method of setting and removing diagnostic traces.

Chapter I – Control of diagnostic log file size and location.

ITM Diagnostic log file size and location control differ by platform [Linux/Unix, Windows, z/OS and i/5]. This diagnostic log information contains detailed process information. By default – when control set to ERROR – you see error messages and any information level messages. The error messages do not always mean an actual error condition. The entire goal is to help IBM Support to understand product problem issues. This chapter shows how to control the size and location.

All of the examples assume the default install location. In practical usage you will specify the directories actually chosen for the particular installation.

Linux/Unix

Following is the best practice from ITM 623 GA onward. Earlier best practice is described at the end.

  1. In the /opt/IBM/ITM/config directory create a file xx.environment which has the same attributes/owner/group as xx.ini. For example, lz.environment and lz.ini or ms.environment and ms.ini. If one already exists, just use it. Here are example commands to create a new such environment file.

    • cd /opt/IBM/ITM/config
    • ls -l ms.ini which showed  -rw-rw-rw-    1 itmuser  staff          2565 Aug  5 21:17 ms.ini
    • touch ms.environment
    • chmod 666 ms.environment
    • chown itmuser ms.environment
    • chgrp staff ms.environment
  1. These environment variables are applied just before the start process completes and will take effect without any reconfiguration. A recycle is needed of course. When testing is over they can be commented out or deleted or the whole file deleted.
  2. Other environment variables can be specified, but we are working here on just KBB_RAS1_LOG where characteristics like diagnostic log segment sizes are defined. If this is absent, ITM has some built in defaults.
  3. Following are the environment variable lines to enable KBB_RAS1_LOG. Further explanations follow. The lines are separate by blank lines here for clarify but blank lines are not required.

CTIRA_LOG_PATH=${CANDLEHOME}/logs

KBB_VARPREFIX=%

RHN=`hostname|cut -d. -f1`

PRODUCTCODE=ms

KBB_RAS1_LOG=%(CTIRA_LOG_PATH)/${RHN}_${PRODUCTCODE}_%(syspgm)_%(sysutcstart)-.log INVENTORY=%(CTIRA_LOG_PATH)/${RHM}_${PRODUCTCODE}_%(syspgm).inv COUNT=16 LIMIT=20 PRESERVE=1 MAXFILES=64

Explanations:

  1. CTIRA_LOG_PATH needs to be set to directory where the logs will be kept. One large customer uses /var/opt/IBM/ITM/logs
  2. KBB_VARPREFIX needs to be set to allow ITM basic services to recognize substitutions. If % is already in use, [such as in LDAP environment variables] you can use ! or %%.
  3. RHM This will calculate the hostname the same way normal configuration does. It avoids then need for a reconfigure in the process. By rights this would be RUNNINGHOSTNAME but one problem management system used complained about that usage. Since it is calculated dynamically the actual name doesn’t matter.
  4. .PRODUCTCODE this should be set to the product code, like lz for Linux or ms for TEMS.
  5. Following are the elements of the KBB_RAS1_LOG. Some are the environment variables specified about and are not commented on.
  6. COUNT number of diagnostic log segments, 16 maximum. I consider 3 a logical minimum.
  7. LIMIT size in megabytes of diagnostic log segments. I once had to use 200 megs and 16 could to capture 24 hours log.
  8. PRESERVE=1 make sure the first segment is preserved
  9. MAXFILES total number of files to be kept. This *must* be larger than COUNT
  10. %(syspgm) will be replaced by main program like klzagent or kdsmain
  11. %(sysutcstart) will be replaced with start epoch UTC time in seconds
  12. The maximum space used will be MAXFILES*LIMIT in megabytes.

Before ITM 623 GA, best practice was as follows:

  1. Create a file xx.override in /opt/IBM/ITM/config with the same attributes/owner/group as xx.ini.
  2. Put within that file the change you want to introduce. Compared to modern setups the values need to be specified within single quotes such asKBB_VARPREFIX= ‘%’
  3. Instead of calculating RUNNINGHOSTNAME and PRODUCT on the fly, instead figure out what they should be and add them to the KBB_RAS1_LOG definition.
  4. In the xx.ini file add the following line. /opt/IBM/ITM/config/overrideThis is known as a source include definition.
  5. For many agents this is sufficient. TEMS is a instanced agent and more work is needed.
  6. The first method is to do a ./itmcmd config -S -t <temsname> and accept all defaults.
  7. The second method is to add the source include line into the hostname_ms_temsname.config file at the end.

Windows

Windows diagnostic log segments have the same form as Linux/Unix.

Tracing is defined via the Managed Tivoli Enterprise Monitoring GUI.

  1. Right click on agent [such as TEMS or Windows OS Agent]
  2. Select Advanced
  3. Select Edit Trace Parms…

There are entry areas for

Maximum Log Size Per File [LIMIT]

Maximum Number of Log Files Per Session [COUNT]

Maximum Number of Log Files Total [MAXFILES]

As usual a recycle is needed to implement.

z/OS

The diagnostics log information is included in the RKLVLOG SYSOUT file.

When intensive tracing is configured this can grow to a large size. If that happens, the following command will close the current SYSOUT file and start another. In that way the current log can be captured to a disk file. That is often performed using SDSF after the switch to “print” to a disk file.

    /f cmstask,TLVLOG SWITCH

The CLASS option can be used to specify the output class of the new sysout file.

    /f cmstask,TLVLOG SWITCH CLASS=W

z/OS logs are not configurable with KBB_RAS1_LOG. Configuration variables are kept in the RKANDATU PDS member KxxENV.

i/5

The i/5 platform [previously AS/400] uses the following form for KBB_RAS1_LOG. This environment variable is placed in QAUTOTMP/KMSPARM file member KBBENV.

 KBB_RAS1_LOG=

 (QAUTOTMP/KA4AGENT01 QAUTOTMP/KA4AGENT02 QAUTOTMP/KA4AGENT03)

 INVENTORY=QAUTOTMP/KA4RAS.INV

 COUNT=3

 LIMIT=5

 PRESERVE=1

 MAXFILES=20

KBB_RAS1_LOG is specific as one lone line although it is presented here on separate lines for clarity.

The meaning of the controls is identical to Linux Unix.

Chapter 2 Operations log

The operations log is a high level view of ITM operations. It bears a close relationship to the ITM Universal Message Console. Universal Messages are written into a wrap around in storage table and are also written to the disk log for later analysis. The interesting benefit here is that you can write situations against the Universal Message Console. Each ITM process has a universal message console and if you want more details see this document:

Viewing the Universal Message Console (UMC) in ITM 6.x

Linux/Unix ITM Operations Log

The TEMS operations log is contained in the /opt/IBM/ITM/logs directory and has the name format

<hostname>_ms_<epoch>.log

Where

<hostname> names the server the TEMS is running on

<epoch> is a decimal number corresponding to the POSIX epoch – the number of seconds since 1 January 1970 not counting leap seconds at the time the log started. Diagnostic logs use a hex representation of epoch in the logs.

ITM Agents on Linux/Unix are contained in the same /opt/IBM/ITM/logs directory and has the name

<agent_name>.LG0

Where

<agent_name> is the managed system name [or Agent Name]

During each startup, the current <agent_name>.LG0 is copied to <agent_name>.LG1.

The Warehouse Proxy Agent operations log is not kept as a disk file. Those logs are preserved in a data warehouse table.

Windows ITM Operations Log

The Windows TEMS operations log is in the C:\IBM\ITM\cms directory and has the name kdsmain.msg.

The Windows Agent operations log has the same name format as Linux/Unix but any colons [:] in the agent name are converted to underlines. That is required because the Windows operating system does not permit colons in file names. The location depends on many factors including 32 versus 64 bit agent instance.

z/OS TEMS and Agent operations log

There is no separate log for z/OS operations log. Instead it is inter-mixed with diagnostic log lines in the RKLVLOG SYSOUT file.

i/5 Agent operations log

The operations log information is a message queue

QAUTOMON/KMSOMLOG

Chapter 3  ITM Diagnostic String Format and Explanation

In the next chapter we show how to implement diagnostic traces but first let’s talk about the strings that define the diagnostic trace.

Every diagnostic string should start with

error

That is the default and is enough to understand many issues. In the diagnostic log you will see informational and “error” class messages. Text not within quotes is case independent, you can code that as error or ERROR and it means the same exact thing.

When you need more information one or more series of trace controls are added. Here is one example

(unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

Each control extends how much will be traced. The unit: setting relates to the source name from which the binary was compiled. There is a kpxrpcrq.cpp source program. The name can be contracted – for example unix:kpx would mean all source units that start with “kpx”. The Entry restricts the tracing to one function name. The following strings represent the sort of messages desired. More about that in a minute.

There is a rarely used control called comp: for component. Here is an example:

(comp:kde,unit:kdeprxi,Entry=”KDEP_ReceiveXID” all er)

Some source units are tagged with a component. The above means component kde [remote procedure call functions] and the rest as usual. Incidentally this last could also be coded

(comp:kde,unit:kdeprxi,Entry:”KDEP_ReceiveXID” all er)

I just happen to view = as more pleasing in this context.

Error classes are the following:

All, Error, Flow, Time, State, Input, Output, Metrics, Detail, Any

Each can be contracted to the first two characters.

These are not documented completely in any product documentation. The exact usage depends entirely on how the developers used the setting in creating the product code. The following is very general and close to reality.

All means the maximum number of messages.

Any means none of the messages.

Error means an error condition was observed. That might or might not mean a product error! I often explain to people that diagnostic and operations logs are used to do diagnosis on customer problems. The logs themselves rarely mean much without knowledge of the customer observed issue. There is a project TEMS Audit which suggests possible issues based on TEMS diagnostic logs.

To conclude this chapter here is a hub TEMS workload/heartbeat diagnostic string:

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) (unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all) (unit:kfastpst,Entry=”KFA_PostEvent” all er) (unit:kdssqprs in metrics er) (unit:kdsstc1,Entry=”ProcessTable” all er) (unit:kraafira,Entry=”runAutomationCommand” all) (unit:kglhc1c all)(UNIT:kfaprpst ST ER) (comp:kde,unit:kdebp0r,Entry=”receive_vectors” all er) (comp:kde,unit:kdeprxi,Entry=”KDEP_ReceiveXID” all er)

Each section adds an additional source section and type of diagnostic message. They are all ORed together.

You can put spaces between the () sections or not depending on what you prefer.

Efficiency

In general tracing is very efficient. That said, one customer set KBB_RAS1=ALL on a remote TEMS and the remote TEMS was unstable. Usually IBM Support will specify a trace for a particular service. It also gets used to capture performance data.

In some appendixes near the end there are multiple examples of useful traces.

Chapter 4 Communications Tracing – and Why to Avoid!!

This deserves a separate chapter for two reasons.

1) Communications tracing is usually misused and often delays diagnostic progress for days or week or months.

2) The style of setting communications tracing is wildly different than the style just documented.

Never do Communication Tracing Unless Specifically Instructed

Many ITM issues present as communication issues. Remote TEMSes disconnect from hub TEMS, agents switch from one remote TEMS to another, agents go on and offline, new agents cannot connect to a TEMS, and the list could go on and on. These cases are almost never a communication issue!! Most the time these are workload or configuration issues. That may seem counter intuitive but it is an everyday fact.

There have been some thousands of ITM PMRs a year since 2004. As a guess without research perhaps 70,000 from 2004 to 2017.  Perhaps 5,000 have had some apparent communications issue. And yet I have been aware of maybe 50 actual communications problems. If you double that to include cases I was not aware of, that means 100 – maybe one a month. That means 98% of the time there are no underlying communications issues and that is probably an underestimate.

When you add communications tracing beyond the default, most of the diagnostic trace log slots are used for that purpose. You cannot see diagnostic log entries that might inform on what is happening. The diagnostic log segments wrap around and often the communications trace entries cause the log segments to wrap around in 10-20 seconds or lower. You have a much smaller window of time and that might not include the time of most interest.

In proper communications tracing you need tracing at each point of the ITM conversations – like the agent side and the TEMS side. You also usually need a TCP packet trace on each side, a substantial effort. Turning on communication tracing on one side rarely adds useful information and prevents recording of useful information. It is 100% wrong to turn on extended communication tracing unless it has been specifically requested by IBM Support L3 or Development or by an experienced support IBM engineer.

Turn added communication tracing on and you delay resolution 98% or more the time. Please take this into consideration before wasting your and IBM’s valuable time and delaying resolution of your problem. If another IBMer asks you to do that push back unless the request comes from an experienced source. If a colleague suggest that, resist that unless you can do all the needed work and get the go ahead from IBM Support experienced engineers.

Communications tracing Details

Tracing communications involves a large suite of interconnected functions. Because of this the default is to turn them on at different levels at once. There are 3 environment variables involved and one super environment variable that can turn them on or off together.

KDC_DEBUG: Low level TCP logic

KDE_DEBUG: Remote Procedure Call logic

KDH_DEBUG: Internal Web Server logic

KBS_DEBUG: the Logical combination of above three at once. This is not available in all circumstances. I think it was introduced in ITM 621 and some cases still don’t have.

Possible Settings

N – Normal. This default level gives enough information about status changes and errors to track what is happening.

Y – Yes. Adds more details to understand the sequence of logic.

I – Inhibit. No communication tracing at all.

V – Verbose. More details

S – State. State changes

T – Trace.

M – Maximum. Adds more. This can be useful if you have a mysterious “failure to resolve DNS name” condition. It will show you that name and then you can track it down in the configuration.

D – Detail.  Adds dumps of internal control blocks and packets

A – All. Adds “private” functions which are normally suppressed as trivial and uninteresting.

In later chapters we will see how to change these settings statically or dynamically.

Chapter 5 Implementing Diagnostic Trace Controls – Statically

Modern environment files – Linux/Unix

This was introduced at ITM 623 GA – 2011/8/31.

To implement this tracing control, which tracking incoming agent results

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

Using the technique from Chapter 1 create a file xx.environment. Add this line

KBB_RAS1=error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

Note that no quoting is needed around that string.

The one exception is the Warehouse Proxy Agent – HD – before ITM 630 FP7. In that case the ini file is needed.

If you are using communication tracings, this is where they are added, like

KBS_DEBUG=Y

Update ini files – Linux/Unix

At all levels you can add the line

KBB_RAS1=error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

to the xx.ini line. That is the only thing needed for normal agents. Each time the agent is started [or stopped] the .config file is recreated.  

If you are using communication tracings, this is where they are added, like

KBS_DEBUG=Y

For instanced processes like TEMS you need to update the config file. Do that by updating the xx.ini file and reconfiguring the process. You can also update the config file for instanced agents. For example, on recent TEMS you can add this to ms.config or from the beginning hostname_ms_temsname.config

KBB_RAS1= ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er) ‘

Note the single quotes. That is because this is logically a korn shell file inclusion. Make sure it follows any existing KB_RAS1 setting.

If you are using communication tracings, this is where they are added, like

KBS_DEBUG=’Y’

The advantages of updating xx.environment are one) no config needed, just a process recycle and two) such updates are automatically preserved based any ITM maintenance process.

Windows Static Trace Setting

Using MTEMs, right click on the process line, select Advanced, select Edit Trace Parms…

In the dialog box enter this in the Ras1 Filter box

error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

Note that there is no explicit KBB_RAS1 here. If you include it things won’t work.

In that dialog box you can alter KDC_DEBUG settings. For the other ones add using Advanced/Edit Variables…

i/5 Agents

Add the following line to  QAUTOTMP/KMSPARM file member KBBENV and recycle the agent

KBB_RAS1=error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

If you are using communication tracings, this is where they are added, like

KBS_DEBUG=Y

z/OS

Add the following line to RKANDATU PDS member KxxENV, where xx represents the ITM process. For example KMSENV for TEMS or KD5ENV for the DB2Plex agent

KBB_RAS1=error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

If you are using communication tracings, this is where they are added, like

KBS_DEBUG=Y

Chapter 6 Implementing Diagnostic Trace Controls – settrace

From ITM 623 FP2  new tacmd functions were introduced

Tacmd settrace – set trace at a managed system

Tacmd listtrace – display current settrace settings

Benefits:

  1. No need for direct access to the system where the ITM process running
  2. The initial userid validation serves instead of needing a per system validation.
  3. Works transparently across all firewalls, KDE_Gateways and ephemeral connections.

Requirements: All links must be at ITM 623 FP2 or later.

With this you can alter six tracing properties

KBB_RAS1

KDC_DEBUG

KDE_DEBUG

KDH_DEBUG

KLX_DEBUG [z/OS only]

KBS_DEBUG

First do a tacmd login first to the hub TEMS. A typical command is

Linux/Unix

cd <installdir>/bin

./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o ‘error ‘

Windows

cd <installdir>\bin

tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o “error ”

The reset option changes the tracing property back to what is was before the most recent settrace

Linux/Unix:  ./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -r

Windows:  tacmd settrace -m <managed_system_name> -p KBB_RAS1 -r

Here is an example from earlier:

Linux/Unix

./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)’

Windows looks more complicated. The surrounding quotes are a double quote ” and any embedded double quotes must be tripled.

./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o “error (unit:kpxrpcrq,Entry=”””IRA_NCS_Sample””” state er) “

Here is a basic workload trace with heartbeat observation at hub TEMS

./tacmd settrace -m <hubtemsnodeid> -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry=”getHeaderValue”  all) (unit:kshreq,Entry=”buildSQL” all)(unit:kfastpst,Entry=”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry=”ProcessTable” all er)(unit:kraafira,Entry=”runAutomationCommand” all)(unit:kglhc1c all)(UNIT:kfaprpst ST ER)(comp:kde,unit:kdebp0r,Entry=”receive_vectors” all er)(comp:kde,unit:kdeprxi,Entry=”KDEP_ReceiveXID” all er) ‘ 

The tacmd settrace is tremendous improvement compared to what was done earlier. If you need continued tracing starting now a static trace is set and also a tacmd settrace to start immediately.  That way tracing will continue after a process restart.

The major case where tacmd settrace is not useful is when a condition occurs so early there is no time to get it specified before the condition occurs. In that case a static is the only possible method.

Communication Trace controls change dynamically

Linux/Unix: ./tacmd settrace -p KBS_DEBUG -o ‘Y’

Windows: tacmd settrace -p KBS_DEBUG -o “Y”

Chapter 7 Implementing Diagnostic Trace Controls – earlier methods

The major earlier method is “Monitoring Service Console”. You start the process by a web browser session to the system where the ITM process of interest is running.

http://hostname:1920

https://hostname:3661

This will present a list of services. The one of interest is IBM Tivoli Monitoring Service Console.  Click on that line. At that point you need to enter a userid/password which is validated using “native” services – appropriate to the system platform. For example if system is Windows, that must be a valid userid/password on that system.

When you arrive at the service console screen there is an area to compose or copy/paste in commands and then press Submit button. The commands are

ras1 – help function

ras1 set – turn on/off tracing

The format looks a lot like above. Here is an example

ras1 set error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)

One thing that is different is how to turn such a trace off. “any” turns everything off and add er [error] to turn back on tracing.

ras1 set error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” any er)

ras1 list – list current tracing

You can also set KDC_DEBUG/KDE_DEBUG/KDH_DEBUG. It does NOT recognize KBS_DEBUG.

bss1 config KDC_DEBUG=Y

bss1 config KDC_DEBUG=N

Problems with the Service Console

First you need password and userid credentials on the target system. In large environments that may not be available.

Second, you need to get to communication access. If firewalls are involved that can be quite complicated. If there are several ITM processes, each one will have an internal web server. Only one gets to own the port 1920 [default]. The other web servers register with the port owner. When you click on the Monitoring Web Server, that might well refer to another listening port like 57246. Firewalls might be open to 1920 but not to that other and suddenly you cannot get to the service console.

Third, z/OS agents can connect via VTAM/SNA [not TCP] and so cannot get to that internal web server usually.

Fourth, turning traces on and off are crazy hard because of needing to track what was turned on and then using ANY to turn it back off later.

Chapter 8  z/OS dynamic trace changing.

Another way of dynamically changing traces exists only on z/OS ITM processes. At this writing it has been largely supplanted by tacmd settrace, but it could still be useful in a largely z/OS ITM environment. Following is an example taken from a technote published a few years ago.

1) Create a member in &rhilev.&rte.RKANCMDU called TRACEON and place these statements in it:

CTDS TRACE ADD FILTER ID=BL1 UNIT=KFAOT CLASS(ALL)

CTDS TRACE ADD FILTER ID=BL2 UNIT=KO4RULEX CLASS(ST,ER)

CTDS TRACE ADD FILTER ID=BL3 UNIT=KO4SITMA CLASS(ST,ER)

CTDS TRACE ADD FILTER ID=BL4 ENTRY:IBInterface::sendAsyncRequest (FL,ER)

CTDS TRACE ADD FILTER ID=BL5 UNIT=KO4LODGE CLASS(ST,ER)

2) Create a member in &rhilev.&rte.RKANCMDU called TRACEOFF and place these

statements in it:

CTDS TRACE REMOVE FILTER ID=BL1

CTDS TRACE REMOVE FILTER ID=BL2

CTDS TRACE REMOVE FILTER ID=BL3

CTDS TRACE REMOVE FILTER ID=BL4

CTDS TRACE REMOVE FILTER ID=BL5

3) Issue the following command from SDSF or your system

console:

/f <tems_task>,TRACEON

4) Recreate the problem and

then:

/f <tems_task>,TRACEOFF

Remember that you can use

    /f cmstask,TLVLOG SWITCH

to close the current RKLVLOG sysout file and start a new one. The just closed one can be “printed” to a disk file in SDSF in preparation for sending to IBM support.

Summary

This concludes the discussion on how to defined, start and stop ITM dynamic tracing. ITM development worked hard to make the product traceable to enable effective support.

Sitworld: Table of Contents

History and Earlier versions

1.00000

Initial publication

1.01000

Chapter 2 on Operations Log added

1.02000

Correct Linux/Unix KBB_RAS1_LOG example

1.03000

Chapter 3 on Diagnostic Trace Strings

1.04000

Chapter 4 on Communications tracing

Chapter 5 on Static diagnostic trace settings

1.05000

Chapter 6 on modern dynamic trace changes

1.06000

Chapter 7 on older dynamic trace change methods

1.07

Chapter 8 on z/OS only tracing

Photo Note: Jealous Octopus Holding On to Jade Sculpture

 

Sitworld: ITM2SQL Database Utility

zucchini

Version 1.37000 11 November 2016

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

I have long envied the TEPS facility called migrate-export, which takes the TEPS database and creates a file of SQL commands to recreate the database.

Recently I found a way to accomplish this on a live ITM system using the KfwSQLClient utility. It does not have an obvious use case to solve a specific problem but can be useful in viewing TEMS database contents and producing data for ad-hoc reports.

The result is not suitable for a backup – see Best Practice TEMS Database Backup and Recovery because tables are often jointly updated. A capture of a table and a capture of a second table a few seconds later can miss the combined update. I have even seen cases where a capture of a single table gets inconsistent results. That was getting the node status table from a remote TEMS when there were a terrific number of duplicate agent name cases constantly changing the table.

More than anything, I had wanted to do it for a long time and after years figured out a way to accomplish the goal. If nothing else there is a pleasure in satisfying that desire and sharing the results.

Overview

The itm2sql.pl utility uses the TEMS catalog file kib.cat and the running system to capture a TEMS table current contents and produce a file of

1) INSERT SQL statements

2) Tab Separated Variables suitable for a spreadsheet program

3) Text file with fixed length columns for easy reference, sorting and searching

4) An index only file which is useful when comparing two tables for differences

ITM2SQL Package Installation

The package is itm2sql.1.37000. It contains

1) Perl program itm2sql.pl.

I suggest itm2sql.pl be placed in some convenient directory, perhaps a directory itm2sql of the installation tmp directory. That is what examples will assume. For Windows you need to create the <installdir>\tmp directory. You can of course use any convenient directory.

Linux/Unix:  /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always comes with Perl shell script installed. For Windows you can install a no cost Community version from www.activestate.com if needed.

The hub TEPS should be connected to the current primary hub TEMS. That can be running on any platform: Linux/Unix/Windows/zOS.

ITM2SQL and Catalog files

The itm2sql.pl processing requires a matching catalog file. This will be found in a TEMS install at

Linux/Unix:  <installdir>/tables/<temsnodeid>/RKDSCATL

Windows: <installdir>\cms\RKDSCATL

z/OS: RKANDATV

For Linux/Unix/Windows the usual catalog name will be kib.cat. For z/OS the member name will be KIBCAT.

Copy that file to the directory where itm2sql.pl will be used – for z/OS the name will have to be changed of course. If the TEPS is running on the same system as the TEMS, you can just supply the fully qualified name and not use a copy.

The kib.cat gives you references to most of the tables that TEMS uses. There are other TEMS components – e.g. remote deploy – which uses other catalogs. Remote deploy uses kdy.cat. This document does not mention such other catalogs further.

ITM2SQL Usage

The ITM2SQL package has parameters to match installation requirements. Following is a complete list.

The following table shows all options.  Extensive notes follow the table.

command default notes
-d off produce debug messages on STDERR
-help off Produce help summary
-home default install location Directory for TEPS install
-ix off Produce a show keys only output
-l off INSERT SQL output with prefix count/keys
-o [file] STDOUT Output to a named file or by report type
-qib off Do not ignore the columns starting QIB
-s key off Name a key, can have more than one
-si file off Process only named keys in index file
-sx file off Exclude named keys in index file
-testf file off Process a previously captured listing file
-txt off Fixed column text file
-tc columns to process with -txt
-f Favorite columns to process with -txt
-tlim 256 maximum column bytes to display, 0=all
-tr off translate tab/carriage return/line feed to blank
-v off Tab Separated Variable
-work directory TMP or TEMP where to create work files

Following the option parameters are two positional arguments. The first is the catalog file – often kib.cat. The second is the name of the table to be processed.

Notes

1) -home if unspecified use environment variable [Windows CANDLE_HOME] or [Linux/Unix CANDLEHOME]. If those are absent use default install locations [Windows C:\IBM\ITM] or [Linux Unix /opt/IBM/ITM].

2) -ix is used to create a show keys only output. You must specify at least one key using -s and the combination of keys must make the reference unique. The resulting file can be used in -si or -sx to include or exclude those keys. This is extremely useful when comparing a capture at one time with a later time – or if comparing one hub TEMS with another hub TEMS.

3) -ix and -txt and -v and -l are mutually exclusive. If all are missing the default is the INSERT SQL output format is produced.

4) -o with no output file [and followed by another – option] will pick a name based on table name and period and [txt=TXT, v=TSV, ix=IX, l=LST, default=SQL]. -o with a following name will use that as output file. If -o is absent, results are printed to standard output.

5) -qib will include columns beginning with QIB which are not represented in the disk files and thus relative uninteresting.

6) -s key – internally these are known as “show” keys because they will present at the beginning of the -l output type. ln combination they should uniquely identify the object.

7) -txt – output report in a fixed column width presentation. The width of column also depends on the length of the column name and a blank is left between columns. This can be useful to feel into your own ad hoc reports.

8) -tc column – list of the columns to display. You can have more than one or you can use multiple columns separated by commas  -tc col1,col2,col3

10) -tlim  – maximum size of txt display columns. If -tlim 0, size chosen is the maximum size in the catalog file.

11) -tr – some columns have spacing controls like tab, carriage return or line feed. This can make the -txt output look strange. With -tr they are replaced by blanks.

12) -v – produce .TSV or tab separated variable output format. This can be opened with a spreadsheet program

13) -work – specify a work directly for temporary files. If -work not specified environment variables [Windows TEMP] or [Linux/Unix TMP] are used. If those are absent the [Windows C:\TEMP] or [Linux/Unix TMP] is used. If -work itself is absent, the current directory is used.

TEMS tables

Here is a list of TEMS tables of some interest. There are over 50 such tables but the following are the ones I find of interest, The columns are not listed here. You could do an SQL type report and get a list of the table columns.

Override definitions

CCT Portal Client Take Action
EVNTMAP Event Mapping
EVNTSERVER Event Server
INODESTS In core node status
ISITSTSH In core Situation Status History
TACTYPCY Workflow Policy Activities
TAPPLPROPS SDA Support – 623 FP1
TCALENDAR Calendar definitions
TGROUP Group
TGROUPI Group Entries
TNAME Long Situation Name and index
TNODELST Node List – online and MSLs
TOBJACCL Object Access – distribution
TOVERITEM Override Items
TOVERRIDE Override definitions
TPCYDESC Workflow Policy Description
TSITDESC Situation Description
TSITSTSC  Situation Status Cache
TSITSTSH Situation Status History

Example usage

To make maximum usage of this utility, you would need to know the TEMS schema and logic. Since that is not published, the process will be more experiment and discovery. Some tables need different catalog files, such as kdy.cat for tables connected with remote deploy.

1) Produce INSERT SQL report for Situation Description and Fullname Table

perl itm2sql.pl kib.cat TSITDESC

perl itm2sql.pl kib.cat TNAME

2) Produce fixed column report for Situation Description table for specified columns

perl itm2sql -txt -s SITNAME -tc SITNAME,AUTOSTART,LSTDATE,LSTUSRPRF,PDT  kib.cat TSITDESC

3) Produce fixed column report for Situation Description table for specified columns and exclude AUTOSTART=*NO

perl itm2sql -txt -s SITNAME -x AUTOSTART=*NO -tc SITNAME,AUTOSTART,LSTDATE,LSTUSRPRF,PDT  kib.cat TSITDESC

Summary

The itm2sql program produces reports on TEMS database tables..

Sitworld: Table of Contents

History

itm2sql.1.37000
Correct some -help and filename detection issues

Note: Eighteen Inch Zucchini From Garden 19 June 2016 – End Section Used For Vegetable Stew

 

Sitworld: Policing the Hatfields and the McCoys

Moonset2016

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 2 May 2016 – Level 0.5000

Follow on twitter

Inspiration

One more time I had to explain to a customer that you could not have a situation formula that included more than a single multi-row attribute group. They had a worthy goal: they wanted to test for a missing process – but only if that process was installed on the system being monitored. Process attribute groups are a multi-row and file information attribute groups are multi-row and this is an illegal formula. The Portal Client Situation editor would have foiled them, since after the first multi-row attribute group is selected, only single row attribute groups are offered when adding the next attribute test. However, like many customers, they used a tacmd editsys to update to formula to what they wanted. I have seen this a couple times a year “forever”.

I was involved because that “monster” ITM situation flooded a remote TEMS with results and did not even achieve the desired effect. The Missing Process situation fired even though the software was not installed. In any event, the remote TEMS overload was so severe that the remote TEMS failed after a few hours. The Situation Audit static analysis tool pointed to the issue and the TEMS Audit tool reported on the massive workload caused by the errant situation. The remote TEMS overload would have been an amazing 100 times more severe except that 100+ such situations had a syntax error which prevented them from running. That is all too common when manually creating situation formula. [One review showed 30% of ITM environments having at least one situation with a syntax error.]

On the other had, the need was real and had been available in a previous monitoring solutions. Two multi-row attribute groups are like two feuding clans – like the legendary Hatfields and McCoys. They just don’t get on at all and there is a lot of collateral damage.

Background

ITM situations are represented by SQL. To make this more concrete here is a simple situation formula for an Agent Builder Agent

*IF *VALUE K08_FILESYSTEMMONITOR.Comments *EQ ‘NO PARAM’

Here is the SQL that represents that represents the situation

SELECT ATTRIBUT10, ATTRIBUT13, ATTRIBUTE0, ATTRIBUTE1, ATTRIBUTE2, ATTRIBUTE3, ATTRIBUTE4, ATTRIBUTE5, ATTRIBUTE6, ATTRIBUTE7, ATTRIBUTE8, ATTRIBUTE9, HIGHTHRESH, IFREE, INODES, IUSED, IUSEDPCT, LOWTHRESHO, MBUSED, MEDTHRESHO, MINORTHRE0, MINORTHRES, MONITORING, ORIGINNODE, PATTERN, TAG, TIMESTAMP

FROM K08.K08K08FIL0

WHERE SYSTEM.PARMA(“SITNAME”, “test_to_check_group_linux”, 25) AND

SYSTEM.PARMA(“NUM_VERSION”, “0”, 1) AND

SYSTEM.PARMA(“LSTDATE”, “1160315090525000”, 16) AND

SYSTEM.PARMA(“SITINFO”, “TFWD=N;OV=N;”, 12)

AND K08K08FIL0.ATTRIBUT13 = N’NO PARAM’ ;

It is a fact of ITM life that the SQL for a situations will only have a single table [equivalent to attribute group at this level.] The TEMA or Agent Support library only handles a single table.,

If multiple attribute groups were available, logic would have to be prepared to define a key to connect the two attribute groups something like this

WHERE … K08K08FIL0.ATTRIBUT13 = K09K09MEM0.ATTRIBUT9 ..

However ITM has no place to make that definition and no logic to process it correctly if it was present. This is a clear product limitation no matter which way you look at it.

TEMS does handle the case of a single multi-row attribute group and a single row attribute group. It creates one or more invisible sub-situations and knits the results together. It does not have the logic at the TEMS to manage the two multi-row attribute group case.

There is a Light Over Here!

Given the extreme customer need, I searched for alternatives and found a way forward in the world of Mathematical Logic and Set Theory. A long time ago I was a math wonk in graduate school and still retain some of the training.

The goal is to calculate a useful result

A and B

for two multi-row attributes even though ITM does not support that.

ITM does have this construction

A *UNTIL/*SIT B

which you specify using the UNTIL tab in the Situation Editor. The logic is that if B is true [on the same managed system or Agent as A]  then any situation event for A is closed and any future Situation Result for A is ignored. In set theoretic terms that is

A and (~)B

or A and not B. You can easily validate that by running through some examples on paper.

The first breakthrough idea is that A and B can use different attribute groups in Base and Until situations. Situation B cannot usually have DisplayItem set, but A can use DisplayItem and there is considerable value in that mixture.

A second set theory logic rule can now be employed

B is the same as  (~)(~)B

Most people have heard it explained that a double negative is the same as a positive. That is one example.

Suppose we were looking at integers from 1 to 20. And then suppose that B had the formula that value > 10.

After B the integers in the result set would be 11,12,13,14,15,16,17,18,19,20.

In this case (~)B would be the test that value <= 10, and the results would be 1,2,3,4,5,6,7,8,9,10.

Now the reverse again (~)(~)B would again the the test that value >10 and the results would be 11,12,13,14,15,16,17,18,19,20.

So B and (~)(~)B have exactly the same results.

The original goal was to evaluate

A and B

As seen above this is identical to

A and (~)(~)B

and also from above that is now equivalent to

A  *UNTIL/*SIT (~)B

Finally, (~)B will have the same result sets as a variant of B where the formula is reversed – say B_rev. So the following

A *UNTIL/*SIT B_rev  is identical in function to A and B

You may want to work through some examples before continuing – in order to convince yourself.

Practical example

I titled this blog post thinking of two feuding clans – in reference to how hard it is to get two different multi-row attributes working together, However by building a wall between them [BASE/UNTIL] and just referencing each others presence we can achieve some valuable results.

There is a zip file attached with model Linux OS Agent situations which demonstrate this working HMC_examples. Following is a a presentation of the model situations.

For this example, we may have a shell file installed in a directory /tmp/lpp and the shell file is run with this command “sh /tmp/lpp/testsl.sh”. The goal is to have a situation event that fires if the command is installed but is not running.

Until Situation

First is the Until clause. The formula is against the Linux File Information attribute group and the test is whether the /tmp/lpp path is missing. When it is missing, the situation will be true and that will allow the base situation to be suppressed.

HMC1

Base Situation

Next is the base situation which tests if the expected process is running. It uses the Linux Process attribute. The test is whether the process “sh /tmp/lpp/testsl.sh” is missing.

HMC2

In The Advanced button we see Persistence is set to 2

HMC3

And that DisplayItem is specified Proc_CMD_Line happens to be the internal attribute name for Command Line. This is not strictly needed here, but is vital if more than one process was defined in the *MISSING clause.

HMC4

Finally the Until tab

HMC5

This is the linkage between the base situation and the until situation.

Limitations

DisplayItem cannot be usually set in the *UNTIL situation. APAR IV74758 – delivered in ITM 630 FP6 – can allow Base/Until DisplayItems in limited cases. This requires a TEMS manual configuration and a precise knowledge that the two DisplayItems are in the same internal format.

Persist=2 must be set on Base situation to avoid race conditions between base results and until results.

If the Base situation could return multiple results, DisplayItem must be defined that multiple events can be created.

Summary

How to get two multi-row attribute groups to influence each each other to gain useful information.

Sitworld: Table of Contents

History and Earlier versions

If the current example situation do not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

HMC_examples

Initial release

Photo Note: Moon-set over the Pacific Ocean 20 April 2016

 

Sitworld: Real Time Detection of Duplicate Agent Names

litez
Version 0.56000 2 July 2017

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

One more time I worked on a case where ITM misbehaved because some agents used duplicate names. This particular case involved “false alerts” where a situation event was observed – a missing process case on a Linux System. When investigated, the Linux System did have that process running and so it was a false positive alert. This cases are wasteful of everyone’s time and degrade the monitoring experience. After considerable time this was determined to be a duplicate agent name case: there were two different systems – one had a missing process and the other one did not. Each agent had the same name and so the investigation was against the wrong system. There were 100+ such cases. The effort consumed meetings over several months and wasted time and energy.

Here is a list of observed problems over the last few years collected by a colleague:

Agents going offline

Agents going offline and online repeatedly

Agents switching back and forth between TEMS’

Situation does not fire as expected

Situation fires unexpectedly

Situation does not start as expected

The data in the situation is not correct

Agent does not respond to requests

RTEMS does not respond to requests

RTEMS is hung

RTEMS is disconnected

HUB does not respond to requests

HUB is hung

Unstable ITM environment

SLOW TEP

TEP shows many navigator updates pending

TEP agent positioning flipping around

HIGH CPU or network usage related to TEPS

And more…

Duplicate Agent Name Progress up to now

There has been work ongoing to identify and resolve these cases. Here are useful tools.

The TEPS Audit blog post is a good first line of detection. You set a trace at the TEPS and then get a report with everything that TEPS sees.

The TEMS Audit blog post has some good reports – such as agents that repeatedly show online or reports at remote TEMS where the arrival of heartbeats is irregular.

The Database Health Checker blog post has a report section based in TEIBLOGT where you can see things like multiple additions to system generated MSLs which can imply duplicate.

We expect future process in this area, including advanced tracing and reports which identify cases where two agents with the same name are connecting to the same remote TEMS.

This post discusses a new cross TEMS check report on current live data.

Node Status Table Correlation Report

Each TEMS has an in-storage table INODESTS or Node Status table. A remote TEMS has entries corresponding to the nodes [agents]  that are connected to it. In ideal cases, the hub TEMS and the remote TEMSes will contain the same information. If there are differences. such as the same agent name present in two different remote TEMSes, that is a very strong signal of a duplicate agent name. That is the goal of the current project.

This package uses a TEPS utility to get the TEMS data for the report. Therefore it is run on the same system as the TEPS.

Package Installation

The following assumes TEPS was installed in the default directory. The data collection work is done on the system which runs the TEPS.   If you are using a non-default install directory then you will need to set an environment variable or specify the install directory in a parameter.

The package is  inodests_sum.0.56000. It contains

1) Perl program inodests_sum.pl.

I suggest inodests_sum.pl be placed an installation tmp directory.  For Windows you need to create the <installdir>\tmp directory. For Linux/Unix create the sql directory. You can of course use any convenient directory.

Linux/Unix:  /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always come with the Perl shell installed. For Windows you can install a no cost Community version from http://www.activestate.com if needed.

Parameters for running inodests_sum.pl

All parameters are optional if defaults are taken

-h home installation directory for TEPS. Default is

Linux/Unix: /opt/IBM/ITM

Windows: c:\IBM\ITM

This can also be supplied with an environment variable

Linux/Unix: export CANDLEHOME=/opt/IBM/ITM

Windows: set CANDLE_HOME=c:\IBM\ITM

-o Output file name

 default is inodests_sum.csv in current directory

-h Help display

-work where to store TEMS database files, default is temp directory, period means current directory

-all record results for all agents, not just problem cases, default show only problem cases

-off include offline agents, usually not much value

-redo perform the report logic using the existing files. Then hub.lst file must be manually determined and renamed. This is mostly for reporting defects to author.

-aff handle one case of lst data from an older TEMS database level

-thrunode create thrunode.csv file for use in a TEMS Database File restoration project. These are consensus thrunodes based on hub and remote TEMSes. The new project recreates missing TNODELST NODETYPE=V records and TNODELST NODETYPE=M system generated Managed System List entries – which are sometimes missing.

Running the inodests_sum.pl

In the temporary directory

perl inodests_sum.pl

Report format

See below for comments.

dup_rep2

Row 49/50 are identical in meaning. Column B is the source – which TEMS supplied the data. Row C is the THRUNODE – where the agent connected. Row D is the HOSTADDR – what system the agent was on an what was the listening port.

Row 48 shows the same agent name reporting to another remote TEMS and using a different ip address.

The conclusion here is that two agents are running on two different systems with the same name. This causes problems and should be stopped.

See below for comments on second report snippet.

dup_rep3

Row 7/8  are identical in meaning. Column B is the source – which TEMS supplied the data. Row C is the THRUNODE – where the agent connected. Row D is the HOSTADDR – what system the agent was on an what was the listening port.

Row 6 shows the same agent name report to another remote TEMS from the same system using a different listening port.

The conclusion here is that two agent instances are running on the same system. That is unusual at it should be stopped.

Correcting Problems

The general procedure is to investigate and resolve. In the first case, login to system and see why two different agent instances are running. Perhaps one was supposed to shutdown and the shutdown failed. Perhaps there are actually two different agents installed. In the second case, the agents likely each have CTIRA_HOSTNAME configured but accidentally with the same value. One of the agents needs to be reconfigured.

Thrunode Report file

The -thrunode option creates the thrunode.txt file in the current directory. This file reports the calculated valid remote TEMS each agent is configured to. If there is a conflict [reporting to multiple remote TEMS] that agent is left out of the report. The thrunode.txt report is planned for use in a new project to restore some cases of missing TNODELST objects.

Reporting problems

The program captures TEMS output of Node status Tables at each TEMS. If things do not work as expected, please capture those in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly.

Summary

The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents.

Sitworld: Table of Contents

History

inodests_sum.0.56000

option to export known good thrunodes – remote TEMSes that agents connect to

Note: Overhead Lights on New Cruise Ship

 

Sitworld: TEPSI Interface Guide

fireplace

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 18 March – Level 1.00000

Follow on twitter

Inspiration

When you use Portal Client sessions which connect to a Portal Server [TEPS] with the default interfaces, everything is simple. You sometimes have to clear a Java Cache but otherwise Things Just Work. That is when you are connecting via the http protocol to the default port [15200 from ITM 623] and not using any secure communications. When you stray from the default settings things get more complex. This document explains the background and implementation to handle non-default configurations.

The TEPS Interface [TEPSI] represents how a TEP user connects. Many users will use the default definition, however some sites will require TEPSI definitions, each with their own name. For example one TEPSI could represent a user connecting beyond a firewall which needs a different ip address and port. Another TEPSI could represent a secure communication connection. A third TEPSI could represent a connection which is limited to using one specific TEPS ip address or interface. There are many possibilities and many can be in service at the same time.

Background

Default TEP to TEPS Communication

In all cases TEP to TEPS communication architecture uses two independent communication methods. At first connection, the TEP uses HTTP(S) to a web server. The URL needs hostname or ip address for the TEPS and port 15200. Before ITM 623, the connection port was always default 1920. After 623 there was a default IHS web server at port 15200. However the TEPS can be configured to the earlier port if desired. During this first startup phase the needed java files are loaded and then given control.

After startup, the portal client uses CORBA IIOP communications. During the startup phase, the client retrieves an Interoperable Object Reference [IOR] object. The IOR contains the ip address and port number to be used to connect to the server for IIOP work. It also has many other values. This allows the Java user interface logic at the TEP to work smoothly with the C/C++ TEPS logic. The IOR is generated by the TEPS Interface definition.

You can specify to use just http protocol. In that case you communicate only with the web server. However the web server currently uses the default iiop to work with the TEPS.

Variation 1 – Secure Communications

The initial stage can use https [secure communications] by connecting to the default https/15201 ports.

The second IIOP stage can also be configured to use secure communications. This could be a changed default IOR or by a second IOR for the purpose.

Variation 2 – NATing firewall

In this setup the TEP is behind a Network Address Translation Firewall. In this configuration the TEP connects to a given ip address and port which communicates to the Portal Server and appears in the usual way as if it was arriving at the normal address like 15200 or 15201.

The related TEPSI also needs to have the TEP side ip address and port set in the IOR so the Java/C++ communications connection will work smoothly.

Variation 3 – Fixed TEPS interface

The TEPS IIOP communications can be forced to one specific interface or ip address. This might be for a TEPS which was working with multiple companies as a service. Each company might require that its communications be transmitted on its own communications path. That does not affect the initial startup phase, where java program objects are being set up, but when company data is being transmitted, they want isolation.

Variation 4 – HTTP mode

A recent addition is where the TEP to TEPS is pure HTTP [or HTTPS]. In this mode, TEP communicates to the web server at 15200 [for example] and the web server does the IIOP communications to the TEPS.

Future Variations

There will very likely be future variations.

Background Summary

Whenever you need non-default TEP connections, you will need multiple TEPSIs. One easy example is if you are using default configuration and one set of users ask to use secure communications and you need to support both secure and default. In that case you create a second TEPSI with a separate name. Maybe a third group requires a NATing firewall TEP. In that case you need a third TEPSI with another name.

The rest of the document shows how you create TEPSI object in Linux/Unix and in Windows. There is a second piece of work required to make use of TEPSI objects. Jar description files are used to get the jar files loaded. There will be one set of those jar description files for each of the TEPSI objects. See this blog post for a convenient and mostly automated way of creating those files base on the default files.

Linux/Unix TEPSI objects

The following is a quote from the installation guide. It is followed by explanations to cover all the possible uses and clarify usages.

Begin quote from manual:

“Defining a Tivoli Enterprise Portal Server interface on Linux or UNIX”

To define an additional Tivoli Enterprise Portal interface on Linux or UNIX, edit the install_dir/config/cq.ini file as described in this section.

Procedure

    Locate the KFW_INTERFACES= variable and add the one-word name of the new interface, separating it from the preceding name by a space. For example:

    KFW_INTERFACES=cnps myinterface

    Following the entries for the default cnps interface, add the following variables as needed, specifying the appropriate values:

    KFW_INTERFACE_interface_name_HOST=

        If you are defining an interface for a specific NIC or different IP address on this computer, specify the TCP/IP host address.

    KFW_INTERFACE_interface_name_PORT=

        Type a port number for the Tivoli Enterprise Portal Server. The default 15001 is for the server host address, so a second host IP address or a NATed address requires a different port number.

    KFW_INTERFACE_interface_name_PROXY_HOST=

        If you are using address translation (NAT), type the TCP/IP address used outside the firewall. This is the NATed address.

    KFW_INTERFACE_interface_name_PROXY_PORT=

        If the port outside the firewall will be translated to something different than what is specified for Port, set that value here.

    KFW_INTERFACE_interface_name_SSL=Y

        If you want clients to use Secure Sockets Layers (SSL) to communicate with the Tivoli Enterprise Portal Server, add this variable.

End quote from manual.

1) KFW_INTERFACES=cnps myinterface

This environment variable specifies all the TEPSI interface names. The default one is “cnps” and in this case “myinterface” represents the new TEPSI you are defining. You do not need to specify the cnps values. These are followed by TEPSI name specific environment variables.

 

2) KFW_INTERFACE_<interface_name>_PORT=

This environment variable defines the port number for incoming IIOP [Corba] communications traffic at the TEPS.  The default cnps uses 15001. By convention these values are typically set to 15002, 15003, etc. Because of the embedded interface_name there will be one such for each non-default TEPSI mentioned in KFW_INTERFACES.

3) KFW_INTERFACE_<interface_name>_HOST=

This environment variable is used when there is more than one Network interface [NIC] on the server running TEPS and you wish to restrict this TEPSI Portal Client communications to a single specific network interface. It is typically given as a numeric ip address.

One use case would be a server that has an interface or NIC that allows communication with the public internet. Someone outside the company network running portal client must use only that network interface and all responses must go through that NIC. Another example would be an outsourcing company providing services for several other companies. Each company could want to have its own protected communications.

4) KFW_INTERFACE_<interface_name>_PROXY_HOST= and KFW_INTERFACE_<interface_name>_PROXY_PORT=

These two environment variables are used when the portal client is behind a NATing firewall. In this case the client must use different ip address and port number which will be translated to the TEPS system and port number expected. PROXY_HOST must be a numeric address.

5) KFW_INTERFACE_<interface_name>_SSL=N/Y

Communications security is required for this TEPSI. If the value is N, no security is used. The second stage CORBA communications uses a certificate which is built-in to one of the jar files which is transmitted across during the initial connection. If you are using non-default security certificate, the documentation shows how to accomplish that.

When the TEPS is restarted, all the needed the <interface_name>.ior records will be created as necessary.

Windows TEPSI Objects

In a Windows environment, you configure TEPS Interfaces in a MTEMS dialog. Right click on the Portal Server select Advanced, Configure, TEPS Interfaces:

Windows_TEPSI

In this dialog you will create new TEPSIs as required.

1) Name supplies the TEPSI name

2) Host is used when there is more than one Network interface [NIC] on the server running TEPS and you wish to restrict Portal Client communications to a single specific network interface. It is must be supplied as a numeric ip address.

3) Port defines the port number for incoming IIOP [Corba] communications traffic at the TEPS.  The default cnps uses 15001. By convention these values are typically set to 15002, 15003, etc.

4) Proxy Host and Proxy Port are used when the portal client is behind a NATing firewall. In this case the client must use different ip address and port number which will be translated to the TEPS system and port number expected. PROXY_HOST must be a numeric address.

5) SSL is used to define communication security for IIOP. If the value is N, no security is used. The second stage CORBA communications uses a certificate which is built-in to one of the jar files which is transmitted across during the initial connection. If you are using non-default security certificate, the documentation shows how to accomplish that.

When the TEPS is restarted, the  Name.ior records will be created as necessary.

Troubleshooting

1) The Proxy host is coded as a numeric value since it is encoded into the IOR record during portal server startup. If it is blank, then the local TEPS server is used. In one memorable case, the server running the Portal Server had an entry in /etc/hosts that specified the host as having ip_address of 127.0.0.1. Initial connection worked fine, the ior record was transferred and then the Java CORBA attempted to communicate to 127.0.0.1 port 15001 which was localhost on the client machine. The testing worked fine on the server running the portal server. You can do a ping hostname on both the machine running the portal server and make sure it resolves into the expected address.

2) Having two or more NICs requires careful evaluation if they are not equal in function. You need to specify the Host variable if there is a specific interface to be used for the needed communications. In addition, you may need to add KDEB_INTERFACE=ip_address for the TEPS communications with the TEMS in such cases. Usually things work with defaults, but when there are NICs with unequal scope you need to watch carefully. This can also be a problem with virtual NICs as those created with some products.

3) If you are experiencing connection problems, here is a ==>URL<== to parse the ior record. Edit the IOR record and copy/paste the text to display the contents. Use it to verify that the ior record is presenting the ip address and port number that is required at the Portal Client endpoint.

Summary

This post explains how to create TEPS Interfaces in Linux/Unix and Windows.

Sitworld: Table of Contents

Feedback Wanted!!

Please report back experience and suggestions. If  Agent Health Survey does not work well in your environment, repeat the test and add  “-debuglevel 300” and after a retry send the health.log [compressed] for analysis.

History and Earlier versions

There are no binary objects connected with the post. This will record summary of changes.

1.00000

Initial publication

Photo Note: Stone Fireplace in Big Sur 2009

 

Sitworld: Portal Client Java Web Start JNLP File Cloner

newcruise

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #3 – 15 April 2020 – Level 0.56000

Follow on twitter

Inspiration

In the next year or so, most ITM Enterprise customers will be converting to Java Web Start Portal Clients. The major browser vendors are abandoning java applets – see Appendix 1 below for three references. Oracle Java itself is abandoning applets. There is no purpose arguing or worrying about it, the change is coming. Happily the ITM Portal Client has the Java Web Start [JWS] as a fully supported option. This post is about how to make the transition and how to do it efficiently. A new tool is made available here to reduce the work involved. The Java Web Start Portal client actually performs better because it is not burdened with an ever changing browser software environment.

The following assumes you have an existing environment with existing TEPSIs and ior files. Converting to JWS is one transition. This post is also important if you need to extend or change an existing JWS environment, such as converting from port 1920 to 15200 or switching from http/15200 to https/15201 or adding an access for a set of users beyond a NATed firewall. It can also be used for tasks like having clients that need to use a Fully Qualified Domain Name instead of a shortname.

If you want to learn more about TEPS Interfaces – TEPSIs – see this blog post for a full background on TEPSIs.

Background – Java Web Start JNLP Files

JNLP stands for Java Network Launching Protocol.

A  default JWS TEP session is started with a command something like this

     javaws “http://<portal_server_host_name&gt;:15200/tep.jnlp”

The actual command line form will often be different. For example the javaws program object may have a full path definition so a known good [for TEP] java level is used instead of the changeable system level. Some environments may require quotes and others not.

In that initial tep.jnlp file, early on you will see

codebase=http://<portal_server_host_name>:15200/

Later there will be a serious of extension tags like this

<extension href=”ka4_resources.jar.jnlp” name=”ka4_resources.jar”/>

The extension tags are used to define the extension jar files that need loading. There will be more extension jar file swith every new agent and sometimes with maintenance. In the above line, there is a file ka4_resources.jar.jnlp and in the first few lines you will see

codebase=http://<portal_server_host_name>:15200/

Each set of base file [like tep.jnlp] and extension files must be consistent. If you want to change to https and 15201, every one of these files must be changed. The codebase= value on the base and each extension file needs to be altered.

Changing the Default tep.jnlp and component.jnlp

If you are changing the default for all users, the solution is pretty easy. There are two files tep.jnlpt  and component.jnlpt

Windows: <installdir>\config

Linux/Unix: <installdir>/config

Early on in each file you will see this line

  codebase=”http://$HOST$:$PORT$/”> 

What you do is replace that with what you need like

   codebase=”https://portal_server_hostname:15201/”&gt;

and then when you reconfigure the browser, the tep.jnlp and all the extension files will get written

Windows: MTEMS right click on TEP/Browser, select reconfigure…, click OK

Linux/Unix: ./itmcmd config -A cw

Java Web Start and JNLP File Cloning

The big work comes when you have more than a single connection. For every non-default user connection you need cloned and renamed base base file [like tep.jnlp] and all the extension files [like ka4_resources.jar.jnlp]. All the names and have to be changed, the codebase= always needs changing and sometimes you need to add protocol or databus property values. This cloning work must be repeated after any maintenance or after a new agent is installed – including agents you write with Agent Builder.

This manual work is lengthy and hard to perfect. The purpose of this Clone Tool project and blog post is to speed the process and make it more reliable.

Project Installation

A zip file is found found jwsbuild.0.56000. There is are two files jwsbuild.pl and jwsbuild.ini. Install them somewhere convenient.

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

jwsbuild.pl has been tested with

Windows: This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

Linux: zLinux with Perl v5.8.7

Initialization file – default jwsbuild.ini

Here is an example jwsbuild.ini file

basename tep.jnlp

base C:\IBM\ITM\CNB

target C:\projects\Packages\jwsbuild\target

clone outside https://ADMINIB-DN65QI8:15201/ iiop outside.ior

There are four controls

basename: optional, default is tep.jnlp

base: directory where the existing jnlp files are stored

target: directory where the cloned jnlp files are stored

clone: clone name, codebase value, optional protocol and optional ior file name

The clone control has two to four parameters: 1) Clone name,  2) codebase URL value,  3) an protocol specification and 4) if protocol is iiop then this is the ior record name. This control file assumes that there is a a TEPSI defined with the name outside. The codebase is the URL needed from the TEP environment to access the TEPSI.

First Step – Create the jwsbuild.ini file

You may not need a TEPSI. For example some users may need to use a fully qualified name for the TEPS host while other users can use a shortname. Just create a clone control that looks like this:

clone miami http://miami.us.ibm.com:15200/

The same technique could switch some users to secure http

clone miami https://miami.us.ibm.com:15201/

Or you might want that connection to use http instead of iiop

clone miami http://miami.us.ibm.com:15200/ http

If your url will use TEPSIs,  inventory the TEPSIs you have or will create. If you are unsure  read ==>this<==.

The existing ior files are found here

Windows: <installdir>/CMB

Linux/Unix: <installdir>/<arch>/cw

Windows: MTEMS, right click on TEPS, select Advanced, select Configure TEPS Interfaces…

Linux/Unix: view <installdir>/config/cq.ini and get data from the environment variables

Second Step – Create the cloned JNLP files and Test

Make a safety copy of the base directory files for use in a back out.

After having written the jwsbuild.ini file, run the cloning tool

perl jwsbuild.pl

This will create cloned jnlp files in the target directory. In the above example there will be tep_miami.jnlp, ka4_resource_miami.jar.jnlp and one for every other extension file.

Copy the cloned files into the base directory. The original base files are not changed. However if you are doing this a second time, you may be altering existing JNLP files. The first few times it would be good to view the cloned files manually and validate them.

You can test immediately after the files are copied – no TEPS recycle is needed. Test out the default and each access as needed.

JWSBUILD Parameters

-ini   specify an alternate initialization file

Summary

This document explains how to use jwsbuild to create the cloned jnlp files needed to start Java Web Start Portal Client sessions.

Sitworld: Table of Contents

Feedback Wanted!!

Please report back experience and suggestions.

Not all combinations have been tested. If you use the jnlp cloning tool and experience problems, please let the author know.

Appendix 1 Java Applet Future References

Firefox dropping Support of NPAPI Plugins

Oracle dropping Support in Java SDK 9

How will Java be supported in Chrome after Chrome drops the NPAPI support?

History and Earlier versions

There is a zip file with program objects. This list references historical levels in case the current level shows a problem.

jwsbuild.0.56000

Handle cnp.http.url.host

Photo Note: New Cruise Ship Under Construction Winter 2016

 

Sitworld: Diagnostic Snapshort Utility

Radar

Version 0.50000 4 January 2016

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

I get called into a lot of unusual diagnostic cases. ITM has an terrific ability to capture detailed diagnostics and that works great for issues that can be recreated. However rarely occurring conditions – perhaps happening every month or so at random times – are a lot harder to capture. The diagnostic logs are a fixed maximum size [thank goodness] and by the time the condition is noticed and the logs captured, the relevant data is often lost. The condition even tougher if you need to collect diagnostic data from multiple locations – say hub TEMS and remote TEMS and an agent and TEPS at roughly the same time.

I was reflecting about this one day while working on the case of a situation event that should not have occurred. I realized that if there was a workflow policy waiting for that situation result then some logic could be performed. [Workflow policies work on results and not the TEMS concept “situation events”.] Also since the Workflow Take Action activity can be performed at any managed system, the flexibility exists to grab data from anywhere.

The next step was to create a “capture diagnostic data now” program that could be run multiple times without overlaying data. The following project was created in 2011, before the current blog and still has a lot of value for advanced diagnostic capture.

Overview

The snapshot.pl utility captures the current diagnostic log segments and the operations log and stores it in a compressed time stamped file.  The utility can be triggered by time, by a situation event, by a workflow policy or by some external program. In this way the current diagnostic information can be collected even when substantial tracing are required.  The utility has been tested on AIX, Windows, Linux, SunOS, and HP-UX.

Preconditions – Things to test and know before using snapshot.pl

1)         The environment must have the Perl script interpreter installed. All the testing has been at Perl 5.8.2 and higher. One test at Perl 4 failed completely. The easiest way to discover this is to login to the server where you plan to run snapshot.pl and enter the command “perl -v”. In one of the test environments, Perl 4 was the default install but Perl 5 was installed on another directory path. That was resolved by changing the first line of the Perl program where the install target can be recorded.

2)         The Windows ITM environment must have a temporary directory such as c:\ibm\itm\tmp. The same logic is running in Windows/Linux/Unix and so having a uniform directory structure avoids platform specific controls.

3)         The snapshot.pl default environment usually has the CANDLEHOME [Linux/Unix] or CANDLE_HOME [Windows] environment variables set. This represents the install directory. This environment variable will always be present when the utility is executed as a situation action command. It may have to be added when the utility is invoked manually or from a non-ITM environment such as a Windows AT command. Alternatively, the -ch option can be used to supply that information externally.

4)         The snapshot.pl utility recovers information from the <installdir> logs directory and for Windows agents from the <installdir>\tmaitm6\logs directory. There are certain conditions that must hold before the recovery will work. These can be tested for by running the utility manually and correcting any problem conditions. Following are some checks that can be made ahead of time.

5)         For all cases, examine the inventory files [*.inv] and see if they make sense. Inventory files keep track of the diagnostic log segments. For TEMS you should see a single file <hostname>_ms.inv. The user has some control and that could be different. For some agents there are multiple inventory files such as <hostname>_lz_klzagent.inv and <hostname>_lz_kcawd.inv.  If the environment has been installed for some time, there could be outdated inventory files with different hostnames if, for example CTIRA_HOSTNAME began to be used or if the system hostname had been changed. If you discover such a case, then delete the unused inventory files.

6)         The Linux/Unix TEMS operations log name is recovered from the logs directory ms.env file. You can verify this by doing

grep “Running: ” <installdir>/logs/ms.env

On Windows the operations log is a file with a fixed name in the cms directory – kdsmain.msg. For Windows the kdsmain.ras file is also captured if present, which records exception callstacks.

7)         The agent operations log has a filename that usually ends

            [Linux/Unix] <initial>:<uppercased product code>:LG0

[Windows]      <initial>_<product code>.LG0

Look for duplicate LG0 files and delete them. The snapshot.pl utility will fail if there are no such files or multiple files. This means some files cannot be unpacked on a Windows system since filenames there cannot include a colon except on the initial disk specification.

8)   On Linux/Unix, tar or tar/compress command(s) are used to create the compressed file. While Windows has the capacity to create zip compressed folders, there is no command line to do that work. The snapshot.pl logic overcomes this limitation by dynamically generating a .vbs file to do the needed work. This is created in the tmp directory and has the filename snapshot_<product>_zip.vbs. The base name used can be altered with the -base option.

9)         The snapshot.pl utility creates a folder in the tmp directory named “snapshot_workdir”. The base name used can be altered with the -base option.

First Step – A Manual test

The snapshot.pl must be installed into the ITM environment. A zip file of that file is snapshot.0.50000.  All testing was performed using the <installdir> bin directory, however almost any location will do. For Windows copying the snapshot.pl file is sufficient. For Linux/Unix the file must be prepared before use. First determine the needed attributes by doing a

ls -al <installdir>/bin/tacmd

Here is an example output

-rwxrwxrwx  1 root root 6464 Jul 16  2010 tacmd*

Use the following commands to make snapshot.pl have the same characteristics

chmod 777 snapshot.pl

chown <owner> snapshot.pl

chgrp <group> snapshot.pl

Make the current directory be <installdir>/bin and run the command by entering

perl snapshot.pl

Review any errors and correct as needed. For example, you might have to do a command

export CANDLEHOME=/opt/IBM/ITM

since you are not running in an ITM Action Command environment.

If this completes successfully, there will be a new file in the logs directory with a name like this

nmp180_ms_snapshot_20110425160252.tar.gz        [Linux]

nmp180_ms_snapshot_20110425160252.tar.Z         [Unix]

nmp180_ms_snapshot_20110425160252.zip                        [Windows]

Unpack the file and verify that the expected diagnostic and operational files are present.

If you will be capturing agent logs, test with the product code or -t option

perl snapshot.pl -t ux

Automated Usage

The goal is to run the snapshot.pl utility to collect diagnostic data near the time when the problem condition occurs.  In every case, the Situation action command will look like.

/usr/bin/perl $CANDLEHOME/bin snapshot.pl                    [Linux/Unix]

C:\perl\bin\perl $CANDLE_HOME\bin snapshot.pl             [Windows]

The fully qualified name of the Perl executable is needed because the ITM action command environment may not have the expected PATHs.

See section at end for complete parameter documentation.

Before doing any actual data capture, do a test with a Always True situation [LocalTime < 250000]. After distributing to the target agent, start the situation and verify the tar/zip file has been created and has the expected results. In that way you can verify that the target environment has Perl installed in the expected location and that the parameters are correct. If the tar/zip is not created, there will likely be an explanation in the TEMS or Agent operations log, which collects standard and error outputs from programs run in this way.

Use by invoking periodically

One way to capture data long term is to run the snapshot periodically, such as once an hour or less. The invocation can come from an ITM always true situation [LocalTime.TIME <250000 and a sampling interval of – say – one hour. The action command would look like this on Linux/Unix. The situation action command must be configured to run at each interval.

The utility could also be performed with an external process such as a Unix crontab entry or a Windows AT command. In that case the -ch option would be used to set the install directory.

Use from a Situation – Universal Messages example

The condition might be detectable by a situation. In one case messages were written to the TEMS operations log like this

KO41039    Error in request compileOnDemand. Status= 1157. Reason= 155.

KO41039    Error in request sqlRequest. Status= 1102. Reason= 155

of 13 examples of the KO41039 message [over 4 months] in a four cases it was followed by the message indicating the problem:

KDS9142I   The TEMS HUB_xxxx is disconnected from the hub TEMS

To capture this condition, snapshot.pl was installed and a situation was created against the Universal Messages attribute group. The formula used

(  Category == KO41039 AND SCAN(Message Text) == 155)

The snapshot.pl -delay option was used to delay capture for 60 seconds. That way the following logic could be traced.

The situation was distributed to *ALL_CMS.

After that some intensive tracing was installed as defined by IBM Support.

When the condition occurred again, the diagnostic and operational logs were captured and progress was made.

Use from a Situation – unexpected true event

A situation was used to alert on missing processes. On rare occasions the alert contained invalid data – for example a mount point with low free space but the name of the mount point was blanks.

To capture this condition, a duplicate situation was created with snapshot.pl as the action command. The needed tracing was also installed. When the false event occurred, the snapshot diagnostic data was recorded.

Use from a Workflow Policy

Sometimes data needs to be captured from multiple ITM services such as a remote TEMS, a hub TEMS and an agent. In this case a situation will usually provide the triggering event. The snapshot.pl utility must be installed and tested on all servers where the command will be run. The Workflow Take Action can be set to execute

– The agent

– The TEMS the agent reports to

–  *any* managed system.

For this case, the snapshot.pl should be run as a separate process using a trailing ampersand for Linux/Unix or via a “start /min cmd /c …” for Windows. The reason is that remote commands have a hardcoded timeout of 50 seconds. This time limit will not apply to processes running in the background.

Use from an external Process

Any program running can issue the snapshot.pl utility. If you want to run it remotely, use SOAP to create a universal message at the target ITM service and have a situation waiting for that universal message. See an example of such usage in this technote:

Starting and Stopping ITM situations using external operations

http://www-01.ibm.com/support/docview.wss?uid=swg21462251

Other use cases

There are probably more ways to use the snapshot.pl utility.

Reference – Command line options for the Snapshot.pl Utility

    -h                 Produce help message an exit

    -ch               Install directory

    -t                 Product code, default “ms”

    -host            Hostname, default the result of platform hostname command

    -base           Base name of snapshot and file work directory, default “snapshot”

    -max            Maximum number of snapshots, default 32

    -nz               No compression, compression on by default

    -delay          Delay seconds before capture, default 0 or no delay

    -idir             Sub-path of install directory where inserted files are found

    -i                File specification for inserted files. More than one specification can be used and wildcard [*] are processed. If -idir is not specified, the full path below the install directory must be specified.

   -n                  A comment which will be recorded in a note.txt along with diagnostic files. -n must be the last option and the rest of the argument line is the comment.

If -ch is not supplied on the command line, this environment variable is used

CANDLEHOME         [Linux/Unix]

CANDLE_HOME       [Windows]

Support Statement

The snapshot.pl is *not* an officially supported part of the ITM product. Use it at your own risk. If problems arise, the author will work to resolve the issue. At the same time, if you have suggestions or feature requests or improvements, please communicate those to the author.

Summary

The snapshot.pl utility captures operations and diagnostics logs and added files as needed.

Sitworld: Table Of Contents

History

snapshot.0.50000

Initial release

Note: Radar Bubble on new Cruise Ship

 

Sitworld: tacmd logs summary

Sitworld: tacmd logs summary

RascalJr
Version 0.50000 31 December 2015

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

I recently worked a case where a suspicion arose whether some tacmd functions were causing problems at a hub TEMS. The information was available in a big collection of tacmd diagnostic trace files, like kuiras1_sysadmin_hextime-01.log. However there were thousands of them at the hub TEMS and a manual review was just too much to think about.

Background

Like many such projects I had another program to work as a template. In my Situation Audit tests, I have a regression suite and a summary program to report on all the regression tests. That had the “search a collection of files and make a list of file names” logic. Scraping through logs just needed some decisions about what to capture and I picked

1) Start Date and Time

2) userid used to authenticate the request

3) Whatever was recorded about the function

4) elapsed time estimate

5) Log Name

I simplified the logic by just working on the current directory.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

tacmd summary has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

A zip file is found found kuiras1_sum.0.50000. There is one file kuiras1_sum.pl. Install it somewhere convenient.

Run Time Options

Options:

-h                           display help information

-v                           Produce some progress messages in STDERR

The report is written to kuiras1_sum.csv in the current directory.

Run command example:

perl kuiras1_sum.pl -v

Sample Report

2015/11/20 10:07:00,jga12624,[executecommand],1800,kuiras1_jga12624_564f3714-01.log,

2015/11/20 10:08:31,jga12624,[executecommand],1801,kuiras1_jga12624_564f376f-01.log,

2015/11/20 10:16:34,jga12624,[executecommand],0,kuiras1_jga12624_564f3952-01.log,

2015/11/20 10:21:36,jga12624,[executecommand],0,kuiras1_jga12624_564f3a80-01.log,

2015/11/20 10:24:07,jga12624,[executecommand],1,kuiras1_jga12624_564f3b17-01.log,

2015/11/20 10:38:11,jga12624,[executecommand],0,kuiras1_jga12624_564f3e63-01.log,

2015/11/20 10:45:49,jga12624,[executecommand],1801,kuiras1_jga12624_564f402d-01.log,

Note the report is in ascending sorted order by the start date/time. The interesting thing here is that 3 of the 7 executeCommands took about 1800 seconds to complete. That is suspiciously close to a timeout condition and might need further investigation. That function executecommand actually requires 9 SQLs and can be quite intensive.

In real usage, I correlated this data with an Unix performance capture script and found one function that sometimes caused a TEMS process size growth.

Summary

Summarize tacmd diagnostic logs.

Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

kuiras1_sum.0.50000
Initial Release

Sitworld: Table of Contents

Note: Rascal, Jr – waiting in April 2004

 

Sitworld: Restore Usability to ITCAM YN Custom Situations

MagicAndTiger

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #3 – 7 June 2016 – Level 0.52000

Follow on twitter

Inspiration

Recently an ITCAM YN Agent [IBM Tivoli Composite Application Manager Agent for WebSphere Application] upgraded [to 7.1.0.3.8 or 7.2.0.0.5] caused existing custom situations to be unusable. An IBM colleague spent a week on site at one customer to recover access to the situations. I was asked to create a tool for performing that work with less effort. Here is the result. This only needs to be done once since new custom situations will automatically use the new dynamic affinities.The attribute changes are downward compatible. That means if you only update some agents to the new level [including installing the application support] you can update the user objects with confidence with this tool.

At level 0.52000 a new tool was added to update the TEPS KFWQUERY database objects when necessary.

Background – Affinities

Affinities are a part of the ITM design. They are 43 character strings and have three sections.

1) Affinity string proper – 32 characters

2) Version number – 2 characters

3) Feature – 9 characters

When an agent registers at the TEMS, it supplies an affinities string. This is a definitive statement about what the agent is. The Feature section is mostly about the capabilities of the TEMA or Agent Support library.

Agents give access to attribute groups and the attributes within them. Attributes also have affinities. That is critical because an agent at version 2 might have more attributes than at version 1. The new attributes are tagged with version 2 and that lets TEMS and TEPS make sense of it all.

When a situation formula is composed, the attribute affinities are composed into a joint affinity which defines what agents the situation can run on. For example a Situation with a version 2 attribute could not run successfully on a version 1 agent level.

Originally ITM used static affinities which were [get out your calculator] base 64 coding and each byte representing 6 bits. When the Universal Agent was designed a new dynamic affinities scheme was created. Over time most agents have switched to dynamic affinities. One major dividing point is that static affinities could only represent versions 1 to 12.  With dynamic affinities the limit is 65K.

Affinities are used everywhere within ITM. One important place is the Portal Client, which represents visually the connections which can be made between agents and situations.

Humorous War Story

Once upon a time in the middle 1990s at Candle Corporation, much of the development was done on OS/2 workstations. The affinities data [and other table details] was kept in a ODI [Object Definition Interface] file. A REXX program CATNIP was used to generate the attribute files, the catalog and the TEMS doc files. That is still true.

One day the development environment was switched to Windows NT. Many workstations were replaced by new workstations. The old workstations were donated to a nearby school district and re-imaged for a new purpose.

A few weeks later someone noticed that REXX ODI processing  program stopped working. I was eventually asked to work on the issue. I quickly determined that the error occurred when an REXEC call was made to run a C++ binary program on a workstation at a specific ip address.The REXEC call uses tcp communications to run a command on another machine and return output. That specific workstation belonged to an [apparently] unaware development manager.. The workstations had been replaced, the old workstation had been sent to the school district and that machine had been wiped and reconfigured.

I tracked down the C++ source programs – about 250 Lines of Code. It was doing some 64-base arithmetic – ORing and ANDing the 32 byte attributes. . In a couple hours I redid the work in straight REXX – perhaps 4-5 lines of code. I just had to translate the base 64 characters to 6 bit strings, OR or AND the result, and translate back. That logic replaced the REXEC call and everything started working again. Very low tech but it made everybody happy that a prime component of the development stack was working again.

ITCAM YN Switches to Dynamic Affinities

When the ITCAM YN Agent was upgraded [going to version 13], the user created [custom] situations were left with the original static affinities. As a result, the Portal Client could no longer process them. If you brought them into the Situation Editor, the Distribution tab would have no candidates for distribution. In addition, Portal Client user created queries against the earlier agent levels lost the ability to do Create Another… which was very painful to users that created a lot of ITMCAM YN Agent workspaces.

The current project is designed to cure this issue by updating the situations with original static affinities to the new dynamic affinities. This distribution contains an program for the YN agent. It also includes a program to update the TEPS KFWQUERY. Later project releases will be made as needed for other such agents.

Package Installation

The following assumes ITM was installed in the default directory. The data collection work is done on the system which runs the TEPS.  You can certainly do this any number of ways. For example you could capture the data at the TEPS and then copy the files somewhere else to process. If you are using a non-default install directory then you will need to set an environment variable. If needed, IBM support can do the work for you.

The package is itcamyn.0.52000.It contains

1) Perl program itcamyn.pl.

2) A  itcamyn.cmd [Windows] command to run the SQL statements.

3) A itcamyn.tar file which contains Linux/Unix shell command itcamyn.sh file. This avoids problems storing the line endings. Just untar that into the install  directory.

4) Perl Program itcamyn_teps.pl.

I suggest these all be placed in a single directory. For Windows you need to create the tmp directory and the sql subdirectory. For Linux/Unix create the sql directory. You can of course use any convenient directory.

Linux/Unix:  /opt/IBM/IBM/tmp/sql

Windows: c:\IBM\ITM\tmp\sql

Linux and Unix almost always come with the Perl shell installed. For Windows you would install a no cost Community version from Activestate if needed. You could also move the files to another system where Perl is installed.

Running the ITCAMYN Program.

The following work is performed on the system where the TEPS is running and connected to the hub TEMS

1) Linux/Unix

a) cd /opt/IBM/IBM/tmp/sql

b) If not using default install directory specify install directory like this: export CANDLEHOME=/opt/IBM/ITM

c) sh itcamyn.sh

4) perl itcamyn.pl -lst

2) Windows

a) cd c:\IBM\ITM\tmp\sql

b) If not using default install directory specify install directory like this: SET CANDLE_HOME=c:\IBM\ITM

c) itcamyn.cmd

d) perl itcamyn.pl -lst

The data capture captures several QA1*.DB.LST files. The perl program creates three files itcamyn.log and itcamyn.sql and itcamyn.sql.undo,

The itcamyn.log file shows you what is going to be done… the situation names, the old affinities, etc.

The itcamyn.sql file is a file of UPDATE SQL lines to implement the change.

The itcamyn.sql.undo file is a file of UPDATE SQL lines to reverse the change.

*Note: If there is more than one TEPS in the ITM environment, repeat for each TEPS.

Running the ITCAMYN and ITCAMYN_TEPS Programs.

The following work is performed on the system where the TEPS is running and connected to the hub TEMS. The instructions assume the default install directory of [Windows: C:\IBM\IBM and Linux/Unix /opt/IBM/ITM].

1) Get a copy of the current ITCAM YN attribute file

Linux/Unix: /opt/IBM/ITM/tables/<temsnodeid>/ATTRLIB/kyn.atr

Windows: C:\IBM\ITM\cms\ATTRLIB\kyn.atr

z/OS: RKANDATV(KNYATR)  copy to workstation and change name to kyn.atr

2) Linux/Unix

a) cd /opt/IBM/IBM/tmp/sql

b) copy the kyn.atr from step 1 to this directory

c) run this command: /opt/IBM/IBM/bin/itmcmd execute cq “runscript.sh migrate-export.sh”

this will create an export of the TEPS database: /opt/IBM/ITM/<arch>/cq/sqllib/saveexport.sql

<arch> depends on the architecture you are running on. If you are uncertain locate it by running that command and then find /opt/IBM/ITM -name saveexport.sql

d) run this command

perl itcamyn_teps.pl /opt/IBM/ITM/<arch>/cq/sqllib/saveexport.sql

3) Windows

a) cd C:\IBM\ITM\tmp\sql

b) copy the kyn.atr from step 1

c) cd C:\IBM\ITM\cnps

d) run this command: migrate-export

this will create an export of the TEPS database: C:\IBM\ITM\cnps\sqllib\savexport.sql

e) cd C:\IBM\ITM\tmp\sql

f) run this command

perl itcamyn_teps.pl C:\IBM\ITM\cnps\sqllib\savexport.sql

The itcamyn_teps.log file shows you what is going to be done… the query id, when last changed and by whom, old and new affinities

The itcamyn_teps.sql file is the UPDATE SQL lines to implement the change.

*Note: The itcamyn_teps.pl has an optional parameter -atr which you can use to specify the fully qualified name of the kyn.atr file. That can be more efficient if the TEMS and TEPS are running on the same system. The default is that the kyn.atr file is in current directory.

Installing the Changes

The following work is performed on the system where the TEPS is running and connected to the hub TEMS. The example commands assume TEPS is installed in the default location. Adjust that if you have installed in another directory.

1) Linux/Unix

a) make the location of the sql files the current directory

cd /opt/IBM/IBM/tmp/sql

b) run the utility to make the changes to TEMS

/opt/IBM/ITM/bin/itmcmd execute cq “KfwSQLClient /v /f itcamyn.sql”

c) run the utility to make the changes to TEPS

/opt/IBM/ITM/bin/itmcmd execute cq “KfwSQLClient /v /d KFW_DSN /f itcamyn_teps.sql”

2) Windows

a) make the location of the sql files the current directory

cd c:\IBM\ITM\tmp\sql

b) run the utility to make the changes

c:\IBM\ITM\cnps\KfwSQLClient /v /f itcamyn.sql

c) run the utility to make the changes to TEPS

c:\IBM\ITM\cnps\KfwSQLClient /v /d KFW_DSN /f itcamyn_teps.sql

After making the changes, recycle the hub TEMS and the TEPS. Then bring up a Portal Client session and verify that things are going normally now.

One Big Change with Dynamic Affinities

Before dynamic affinities, a situation could be distributed to multiple agent attribute groups. That is no longer the case and you will need to write separate situations. This is  a necessary side effect of the conversion to dynamic affinities and there are no plans to resurrect the previous behavior.

Summary

These tools will ease the ITCAM YN agent transition process. There are several other agents being worked on that will need the same work later using a revision to this project. If you feel uncomfortable doing this work yourself, IBM Support will work with you.

Sitworld: Table of Contents

History and Earlier versions

If the current version of the ITCAMYN tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

itcamyn.0.52000

Add itcamyn_teps.pl – create SQL to update TEPS KFWQUERY database objects

Photo Note: Magic and Miss Tiger – December 2015

 

Sitworld: TEPS Audit

cruise_ship_101

Version 0.53000 15 December 2015

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

A recent case involved very slow Portal Client response time and general TEMS instability including failures. After some study the root cause turned out to be duplicate agent names.

ITM architecture depends  unique agent names – no duplications. However, there is no logic that enforces that rule and indeed duplicate agent names occur in normal processing. For example, if an agent with two remote TEMS configured loses contact with one remote TEMS because of some communications outage, it can connect via a secondary remote TEMS. During that reconnection logic, the hub TEMS will note the duplicate agent name and accept it. Later on the first remote TEMS will notice the agent going offline [node status update missing], update the hub TEMS the agent is offline and the hub TEMS does the right thing and ignores it. So that duplicate agent name case is a transient normal condition.

However, this case involved some 15 MQ agents all of which had been configured with the same name. This causes the duplicate name logic to occur steadily, 15 times every 10 minutes by default. This project reports duplicate names using a very simple TEPS trace and a report program.

Background

The TEMS must do work to manage the duplicate agent condition: stopping situations on the earlier agent instance and starting on the new agent instance. The TEPS maintains a topology of the ITM Agents and that topology map must be recalculated with every new arrival. Navigator Updates Pending alerts in Portal Client are also seen. In this case the effect was so strong that Portal Client was experienced very poor performance. In addition and worst, only one agent at a time was actually running situations. As the agents swapped in and out of connection situations kept starting and that produced a steady stream of situation events all apparently from the “same” agent and duplicating earlier events.

I worked on this once before  Diagnosing the “Navigator Updates Pending” message and that technote has been very effective. However it requires manual work and analysis to correct.

I also added a potential duplicate agent report to the most recent 1.16000 level of Sitworld: ITM Database Health Checker. That report identifies some cases [agents reporting to different remote TEMSs and Agents reporting with different maintenance levels] but not others [agents with same maintenance level reporting to same remote TEMS]. Therefore something more was needed,

This project works with a TEPS trace and produces a full report on duplicate agent names. That works because the TEPS receives a stream of Situation Status Updates as each agent goes online and perhaps later offline.

At 0.53000 a summary of incoming events is provided. This can be very useful in diagnosing a TEPS that is overloaded with “too many events”. More importantly, when there are event floods that usually means a violation of the prime ITM design: situation events should reflect rare and unusual events which can be worked on to prevent in the future. A violation causes excess work at the agent, the TEMS, the TEPS and the people who attempt to manage the event workload.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

TEPS Audit has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

A zip file is found found TEPS_Audit.0.57000. There is one file tepsaud.pl. Install it somewhere convenient.

Run Time Options

Options:

-h                           display help information

-v                           Produce some progress messages in STDERR

-log                        Produce a log file, default yes

-logpath                  Directory path to TEMS logs – default current directory

-nohdr                    Remove variable header lines – used for regression testing

-istone                    Show instance=1 cases

The remaining parameter [if present] is a log file specification. The log files to study can be specified as a partial name such as HUB_ibms98htems_ms_52052f96. That setting analyzes multiple log sections with the same file name starting string.

In hands-free operation do not specify a specific file. Instead use –logpath to specify a path to a normal TEPS logs directory or leave it off for the current directory. In this mode the directory is examined to determine the current active log segments. The log segments are processed in the sequential time order.

Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain error messages.

The report is written to tepsaud.csv in the current directory.

Run command examples:

perl tepsaud.pl cmsall.log 

perl tepsaud.pl -v

Configuring TEPS trace for Duplicate Agent Report and Situation Event report

The diagnostic trace required is this: ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

See Appendix 1 on how to enable that trace. Generally a one hour trace is sufficient for this purpose.

In addition, if you have any of the following set in cq.ini

KFW_CMW_DETECT_AGENT_ADDR_CHANGE=N

KFW_CMW_DETECT_AGENT_HOSTNAME_CHANGE=N

KFW_CMW_DETECT_AGENT_PROPERTY_CHANGE=N

remove or comment them out and restart the TEPS before beginning the capture. They suppress the raw data we need to create the report.

TEPS Audit Report

The report has a lot of details, so a screen image is tough to understand.

There are two sections. The first is the advisory messages [eventually there will be more messages].

Column Description Example
Impact Importance 100
Advisory code documentation TEPSAUD1002E
Object relates to – agent in this case ibm_test:LZ
Message Node has connected via 62 ip addresses or versions or thrunodes: Likely duplicate name configuration error

The TEPS Node Status Update Report follows. Each agent has a first line

Column Description Example
Agent Name Managed System Name ibm_test:LZ
Node Status Count 291
Different Instances 62

Instances represent how the agent presented including four pieces of data

Agent Name

Thrunode [often remote TEMS]

Affinities [type of agent]

Hostaddr [including ip address

Column Description Example
(blank) to line up with header line
Node Status Count for this instance 7
Online Count for this instance 7
Offline Count for this instance 0
Product Two character Agent Code LZ
Version of Agent 06.23.03
Version addendum Other version information A=00:lx8266;C=06.23.03.00:lx8266;G=06.23.03.00:lx8266;
Thrunode How the agent connected REMOTE_ibmtest101
HostAddr ip address and hostname ip.spipe:#1.2.3.4[7757]<NM>ibm_test</NM>
HostInfo type of host Linux~
Affinities Type of agent and TEMA %IBM.STATIC134          000000000I000Jyw0a7
Timestamps When update occurred 1150911113710000:1150911114710000: …

IP and HOSTNAME and Property change report

There are cases where TEPS produces a shorter statement about conditions where an agent changes ip address, host name, or affinities change. If these are spotted a report section is generated for each. For example

The IP Change report starts with a header line

Column Description Example
Node Managed System Name ibm390:LZ

The first line is followed by instances to represent the observed change

Column Description Example
(blank) to line up with first line
IP Address ip address of agent 127.0.0.1
Count number of times seen 2

In these cases, the from ip address and the to ip address are counted separately. Both are involved and need to be investigated. If the from ip address is blank, the report is ignored since that is the common condition when an agent first connects.

There are similar reports for hostname changes and property changes. These often overlap and solving one case will solve all of them.

Situation Event Report

The TEPS Situation Event Report follows. Each situation name has a first line

Column Description Example
Situation Name Managed System Name ibm_test_situation
Situation event count Count of status 291

The first line is followed by instances to represent how and when the situation event status was observed

Column Description Example
(blank) to line up with header line
Node Agent where result calculated Primary:IBM-XXX1397-S:NT
Type Sampled=0, Pure=1 0
Atomize option distinguishing element arc_IBM_22
Count number of status 9
Open Count situation event opens 5
Close Count situation event closes 4
FullName display Long situation name ibm_test_situation
Sitmon Node which TEMS made event REMOTE_-IBM0067-S
Time_Status Time and event status 1151221143708000_Y:1151221143938003_N

The Time_Status is the ITM time stamp when processed by the TEMS and the Situation Event status. Y=Open, N=Close, S=Start, P=Stop and some rarer cases

Recovery Action Plan – Duplicate Agents

The first case was extreme. There were 115 agent names which showed with more than one connection instance. The next worst two were 62 and 54 connection instances.

The recovery action was very straightforward although laborious. Proper running of ITM requires that each agent have a different agent name. Each case needs to be investigated and corrected. For example in the case above – ibm_test:LZ – each of the 62 systems must be logged on and have the configuration manually changed and the agent recycled. In most cases, CTIRA_HOSTNAME is not needed. The hostname of the system is naturally different and without CTIRA_HOSTNAME specified the agent name will reflect the name of the agent. If you do to need to pick a CTIRA_HOSTNAME – it must be selected so all agents have unique names. In this case it is probably specified in the <installdir>/config directory in the lz.ini file or possibly in the lz.environment file.  Edit that file, delete that CTIRA_HOSTNAME line or specify it so the names are unique.

Do that for every single case of duplicate names… using the report.

When you think you have made progress, redo the TEPS trace  and see if any items remain. As the duplicate agent conditions are resolved ITM will begin running better and better.

The final step is to determine why this duplication exists. Often a “golden system image” is created and then cloned. Part of that process *MUST* be to configure the new agent so it has a unique name. Investigate that process thoroughly to avoid future problems.

This case was extreme, but I have seen a few such cases in almost every large ITM environment.

Recovery Action Plan – Excess Situation Events

In another case, the external symptom was Portal Client giving bad response times and sometimes failing. The Situation Status report showed that 70% of the situation event workload was coming from a single situation. It was alerting on cases where Windows services were in a non-running state. The summary showed that the events were opening and closing quite often. The situation was running on a lot of Windows systems and the combined effect was to bury the TEPS. At the same time, the situation was generating a lot of situation opens and quickly event closes. The recovery plan was to rethink that situation so it only alerted on rare and exceptional cases. The plan was to make the situation less complex by splitting into multiple situations, increase the persist count, and add more tests to ignore common conditions that did not need an alert.

Like many real life cases, once you have the information about what is driving the workload you can figure out how to make things better. In this case there were 550+ active situations and just a single one was causing the problem.

Summary

Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.

Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

TEPS_Audit.0.57000

Avoid a divide by zero condition

Sitworld: Table of Contents

Appendix 1 – Turning Tracing On and Off

A trace filter looks like this

ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

The first token is a general level of tracing. The unit: specifies a local source unit. An optional Entry= specifies a function name.

From ITM 623 FP2 and ITM 630 GA there is a tacmd function to turn on tracing. You must do a tacmd login first.

Linux/Unix

Turn on trace:

Linux/Unix: ./tacmd settrace -m <tepsname>  -p KBB_RAS1 -o ‘ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

Windows: tacmd settrace -m <tepsname>  -p KBB_RAS1 -o “ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

[one long line]

Turn off trace

Linux/Unix: ./tacmd settrace -m <tepsname>  -p KBB_RAS1 –r

Windows: tacmd settrace -m <tepsname>  -p KBB_RAS1 -r

A second way is to use the service console. If you are unfamiliar see this technote:

Dynamically modify trace settings for an IBM Tivoli Monitoring component 

Turn On:

ras1 set ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

Turn off:

ras1 set ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

The general rule is that an “any” specification will remove the effect of any prior set(s). The er [error] should be added back.

Sometimes you cannot use the diagnostic console. Here is what you do for the different platforms.

Linux/Unix: 

Add this to the TEPS ini – which has the name cq.ini

KBB_RAS1= ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

Windows:

In MTEMS GUI

– right click on TEMS line

– Select Advanced

– Select Edit Trace Parms

– In the “Enter RAS1 Filters:” input area add 

ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

To remove the traces, restore the line to “error” in all cases and recycle the TEPS.

Note: Electrical Heart of a New Cruise Ship

 

Sitworld: Re-re-re-mem-ember Situation Status Cache Growth Analysis

propell

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #2 – 26 October 2015 – Level 0.52000

Follow on twitter

Inspiration

Recently I had two cases where a remote TEMS process size grew and grew and performance was horrible. To speed up analysis of such cases the following project and tool was developed and now anyone can figure out one common case.

Background

When an agent connects to a TEMS [hub or remote] the TEMS  gets the right situations running at the agent and evaluates the agent results to determine if a situation event should be opened or closed. The agent sends results and has no awareness of “events”. In fact the results sent to the TEMS might be feeding a workflow policy or *UNTIL *SIT processing. The agent sends one set of results and the TEMS makes copies for all the different purposes.

The TEMS uses a in-storage table called Situation Status Cache – TSITSTSC and you can view it as a disk table QA1CSTSC.DB/IDX. For example if a result with data arrives and the situation event has already been presented, a second event must be avoided. This table maintains current status. If the situation is sampled and has Persist=4 configured, that means 4 results must be returned in a row before a situation event is presented. If three in a row arrive and then a 0 row or false result, no situation event is ever created.

Here is the major data items preserved in the in storage copy of TSITSTSC:

LCLTMSTMP Time situation was evaluated at Agent
NODE Managed System Name or Agent Name
SITNAME Situation Name [can be index if long name]
DELTASTAT Status – Y for open, N for close and others
ATOMIZE DisplayItem if configured
SITCOUNT Current Persist Count

This is a very ordinary sort of processing assist table.

In one specific case, this in-storage table can grow “forever” or until the TEMS is recycled.

  1. Pure Situation – monitoring a log like Windows NT Event Log
  2. Large volume of results
  3. DisplayItem used and constantly changes – such as a Description that has an embedded date/time stamp

When TEMSes run as 32-bit program objects [kdsmain], the upper limit is somewhere under 2 gigabytes. There is one Linux configuration which allows somewhat under 3 gigabytes. The storage growth from the Situation Status cache eventually causes a TEMS failure. It also forces higher and higher CPU Resource consumption because the in-storage table is searched linearly.

These days many TEMSes run as 64 bit program objects. The failure mode now is that TEMS size and resource consumption rises until someone notices and recycles the TEMS. On one memorable occasion an AIX LPAR actually exhausted system paging space and experienced a forced shutdown.

Why Create Situations which Cause such problems?

One reason is that it is convenient to have that DisplayItem filled in. On the Portal Client the Situation Event Console will inform more about the impact or issue. For example a full mount point situation can show which mount point. That can also be useful in programming an event receiver logic. However for the problem case above with very high volume and long Descriptions that reason is hard to justify.

A second reason involves a little documented optimization whereby situation events can be merged. If a situation with no DisplayItem result arrives with 1) same node, 2) same situation and 3) same time to the second that result can be logically merged with the matching result. Within Portal Client, the multiple events can be displayed [to a maximum of 10]. From an event receiver [e.g. Omnibus] standpoint the second and subsequent result in the same second are never seen. If the DisplayItem is specifed and different each result will cause a separate situation event. In many cases this allows the event receiver to see all the events.

This merging can happen with Sampled situations but it is very rare and almost never causes a problem.,

Since 2010 there has been a TEMS configuration to prevent such merging. See this technote for full details and implementation instructions.

ITM Pure Situation events and Event Merging Logic

You configure a TEMS receiving results from an agent type and specific attribute from to perform a one result create one row logic for a specific attribute group.

An even better solution is to use a modern agent like Tivoli Log Agent – which has been part of ITM for many years. That Agent can be configured to send results directly to the event receiver and thus not burden TEMS at all.

Identifying the Situation Status Cache Issue

The easiest way to recognize the issue is to check the size of the TEMS QA1CSTSC.DB file. If this is more than 32meg *and* if it keeps growing over time, the problem may exist. If that file grows into the hundreds of megabytes the TEMS is trending toward failure. You might have to check more than one remote TEMS depending on how the agent workload is configured.

Until now, identifying the specific situations causing the issue has been extremely technical. This blog post and project will let you do it yourself anytime you want. This project contains a data capture command and an analysis program which will show you which situations are contributing to TSITSTSC growth in bytes per day. That report can be used to make the needed configuration changes and thus make monitoring stable and more efficient. This can also be done by IBM Support if needed.

Package Installation

The following  assumes the default install directory. The data collection work is done on the system which runs the TEPS.  You can certainly do this any number of ways. For example you could capture the data at the TEPS and then copy the files somewhere else to process. If you are using a non-default install directory then then shell files will need to be modified. The choice of where to store the program objects is arbitrary – pick whatever you want.

The package is sitcache.0.52000. It contains

1) Perl program sitcache.pl – standing for Situation Status Cache.

2) A  sitcached.cmd [Windows] shell command to run the SQL statements.

3) A sitcached.tar file which contains Linux/Unix versions of the SQL files and a sitcached.sh file. This avoids problems with the line endings. Just untar that into the install  directory.

I suggest these all be placed in a single directory. For Windows you need to create the tmp directory and the sql subdirectory. For Linux/Unix create the sql directory.

Linux/Unix:  /opt/IBM/IBM/tmp/sql

Windows: c:\IBM\ITM\tmp\sql

4) Most often you want to investigate a specific remote TEMS. the sitcached shell/cmd file takes an optional parameter of the TEMS nodeid [not the hostname].

Running the Program.

1) Linux/Unix

a) cd /opt/IBM/IBM/tmp/sql

b) If not using default install directory run specify like this: export CANDLEHOME=/opt/IBM/ITM

c) sh sitcached.sh  – if interested in a specific remote TEMS  sh sitcached.sh <temsnodeid>

4) perl sitcache.pl -lst

2) Windows

a) c:\IBM\ITM\tmp\sql

b) If not using default install directory run specify like this: SET CANDLE_HOME=c:\IBM\ITM

c) sitcached.cmd   – if interested in a specific remote TEMS  sitcached.cmd  <temsnodeid>

d) perl sitcache.pl -lst

One file is created – sitcache.csv.

Screen shot

Here is a view of the CSV file from LibreOffice Calc. Some rows were deleted for this presentation

sitcache1

The Situation name presented is the FullName – as would be seen in Situation Editor.  The report is shown in descending order by an estimate of number of bytes storage in storage and the growth in bytes per day. The Total line situation shows the number of seconds — about 12 days in this case.

This specific case showed two situations which composed 80% of the storage growth. They were Unix Log Agent and when the remote TEMS was configured to do Pure result one row and the DisplayItem was removed the problem was resolved.

Summary

The Situation Cache tool was derived from  Situation Distribution Report.

Sitworld: Table of Contents

History and Earlier versions

If the current version of the Situation Cache tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

sitcache.0.52000
parse_lst handle null chunks correctl

Photo Note: Attaching a Propulsion Unit to a New Cruise Ship in Italy

 

Sitworld: Attribute and Catalog Healh Survey

  sunset-20000115

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #5 – 12 February 2016 – Level 0.83000

Follow on twitter

Inspiration

Recently I worked with a customer that experienced into a rarely seen ITM limit. ITM uses catalog and attribute files to define the data that agents can process from their monitored environments. The TEMS reads the catalog files into a combined catalog table and the attribute files into an in storage attribute collection. These get used in Situations, Historical Data, Real Time data displays and more. This customer had added the 513th catalog file and TEMS failed during startup. Internally .cat files are known as package files and there is an absolute limit of 512 packages. With IBM Support help, the customer removed a few .cat and .atr files, reset the combined catalog file to empty and the TEMS started up just fine.

However this meant the customer was unable to install certain types of maintenance or new applications. There was an urgent need for a reliable way to identify unused catalog and attribute files.

The result is this package which calculates the unused catalog and attribute files. It also produces a health report which tells error cases like an attribute group used in a situation which is missing from any attribute files.

Data Sources

The Attribute files are taken from the hub TEMS environment:

Linux/Unix: <installdir>/tables/<temsname/ATTRLIB

Windows: <installdir>\cms\ATTRLIB

The Catalog files are taken from the  hub TEMS environment

Linux/Unix: <installdir>/tables/<temsname/RKDSCATL

Windows: <installdir>\cms\RKDSCATL

The Situation definition is taken either directly fr