Sitworld: ITM Database Health Checker

Phoenix

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #58 – 23 August 2019 – Level 1.71000

Follow on twitter

Inspiration

In Spring 2014 I was tasked to repair a customer database that experienced a newly identified issue. If an agent was in the node status table [TNODESAV] and the related object in the NODELIST table [TNODELST NODETYPE=V] was absent, the agent dropped into a never never land where no situations would run and no real time data could be retrieved. After preparing a recovery action plan and the customer site was again working properly, I challenged myself to discover this issue from the TEMS Database files [EIB] alone. In a few days I was able to replicate the list of problem agents that had taken quite some weeks to calculate using usual methods.

From  level 1.38000 onward, the end of the report includes an explanation of each advisory message. That was previously in a word document but it was more convenient to include the text in the report itself.

Background

I searched for more such cases of EIB inconsistency and other discoverable problems. The goal was to accelerate problem diagnostic time by providing advisories on customer problems. After a year of effort this project can now discover 52 issues. The issues range from the most severe to the totally innocuous. Now it is time to make this project available to a wider audience. In some cases the issues can be handled independently. In other cases the issues require IBM Support to cure. The issues are ranked in impact from 0 to 110. At 100 or higher, monitoring is not happening in some cases, which is the severest case. There are false positives, for example if you accidentally examine a remote TEMS and the missing node status information produces masses of uninteresting advisories.

The publishing delay was caused by new issues which appeared. A second reason for delay was a desire that the recent maintenance – ITM 630 FP4 – contained fixes for almost all observed issues. In some cases there is repair needed for defects experienced earlier. However in all known serious issues, new problems will not arise when running at latest maintenance levels. There may be error conditions new examples will not arise. Some cases arise from hardware or human error but speeding diagnosis minimizes the impact.

I chose to capture the live system data using a TEPS utility KfwSQLClient. It gets up to date information and it is also a process which uses the existing TEPS to TEMS connection. The retrieval process is rapid and reliable. If needed the resulting files can be moved to another system for analysis. This document assumes you install and run on the same system, however that is not required.

The output report is a .csv report file. The binary package includes a Word document with each issue explained and a recovery plan suggested.

Version 1.02000 now checks all tables which have user data and are synchronized with an FTO hub TEMS. The issue count is now 64.

Version 1.07000 adds the first Case Study to the documentation. This documents a case where Database Health Checker diagnosed a new error case and how that case was promptly resolved. With this Case Study any user could do the same and achieve an immediate recovery.

Version 1.16000 adds an important section which identifies possible duplicate agent names. This is usually an agent configuration mistake. Duplicate agent names cause severe TEMS and TEPS stress. In addition it reduces monitoring since only one agent name at a time will be show events or real time data

Version 1.21000 adds reports about a sampling of situation event history objects. These can be almost invisible stress factors.

Version 1.23000 handles Agent CF better, which is a non-TEMA managed system and so cannot run situations and is not expected to have MSLes etc.

Run via IBM Support!!

If you create a support CASE and upload a hub TEMS pdcollect, a datahealth.csv report is automatically generated. You can ask the support person to make the report file available to you via a FTP server. There may be rare times when this process fails to product a report, for example if there are no advisories above impact zero, but that is extremely rare,

Package Installation

The default install directory is used in the document, however you can make any installation choices. The EIB data is captured on the system which runs the TEPS.  For example you could capture the data at the TEPS and then copy the files somewhere else to process. The choice of where to store the program objects is arbitrary.

Linux/Unix systems come with Perl installed. Windows may need it installed and I use http://www.activestate.com/activeperl, community edition 5.20.No CPAN modules are needed for this package. It will likely work on many different levels.

The package is datahealth.1.71000. It contains

1) A Perl program datahealth.pl and control file datahealth.ini – standing for Database Health Checker. The ini file is not used in the package.

2) A Windows datasql.cmd file to run the SQL statements

3) A datasql.tar file which contains Linux/Unix version of the datasql.sh. This avoids problems with line endings. To use untar datasql.tar into the installation directory.

4) If the install directory is not the default, set the required environment variable before capturing data

I suggest these all program objects be placed in a single directory. For Windows you can create the tmp directory and sql subdirectory. For Linux/Unix create the sql directory.

Linux/Unix:  /opt/IBM/IBM/tmp/sql

Windows: c:\IBM\ITM\tmp\sql

You can run this in any directory, of course.

Running the Database Health Checker Program

Linux/Unix

a) cd /opt/IBM/IBM/tmp/sql

b) If not using default install directory specify like this: export CANDLEHOME=/opt/IBM/ITM

c) sh datasql.sh

d) perl datahealth.pl -lst

Windows

a) c:

b) c:\IBM\ITM\tmp\sql

c) If not using default install directory specify like this: SET CANDLE_HOME=c:\IBM\ITM

d) datasql.cmd

e) perl datahealth.pl -lst

A report file is created: datahealth.csv which is the report. There is a log file which is not much used at present. The rest of the document shows some screen shots.

Screen shots of Database Health Checker Report

The beginning of the report contains the version number, a summary of the hub and remote TEMSes and a record if FTO is configured.

data1

The Hub row has the total agents. The TEMS rows listsTEMS nodeid,  ONLINE/OFFLINE status, the agent count and the version number.

Next is a view of the TEMA Deficit report. [See Appendix for Technical Deficit explanation] TEMA [or the Agent Support Library] is a part of the TEMS and it is bundled with the OS Agent. When this TEMS logic is running at a lower version then the rest of the TEMS, APAR fixes are not present. This is called a TEMA deficit and the ITM environment is exposed to problems that would otherwise never be seen. There have been 122 APAR fixes since the first release in 2005. For an exact analysis of what agents are missing which fixes, please see ITM Agent Census Scorecard.

data2

Average APARS TEMA version less the TEMS version: The average number of APAR fixes missing on each agent compared to all OS Agents upgraded to existing TEMS maintenance level.

Average APARS TEMA version less the latest TEMS version: The average number of APAR fixes missing on each agent compared to  TEMS and all OS Agents upgraded to highest TEMS maintenance level of ITM 630 FP4.

These averages are calculated from every Agent that records a TEMA version [not all do]. Some could be at level and others way behind. One TEMA can be used by many agents, so updating a single OS Agent can improve many other agents.

 The last section contains advisory messages.

data3

The first part is the count of advisory messages and that is followed by the messages. Note that the impact can range from 0 to 110. At 100 and over, some monitoring may not be running. That means no situations, no real time data and no historical data. To increase your interest, here are examples of two cases.

The advisory code DATAHEALTH1037W is fully described in the Word file which is included with the programs. That one looks like

 25 DATAHEALTH1037W CLO_DB2:ibmw1dbs02:LO Agent at version [06.30.00] using TEMA at lower release version [06.23.03] 

So impact is 25 – not too terrible. The third column lists the object involved, this is often an agent name but not always.

This particular case involves an agent at ITM 630 level which is connecting to a TEMS at agent ITM 623 FP3. This is not a supported level and is not tested and is certainly not recommended. It may work or it may cause problems. It is impossible to predict. You will experience fewer problems if you stay within the product guidelines and avoid being one of the “far away from normal” environments.

Another interesting one – which was quite a surprise when I started seeing reports from real life environments.

80 DATAHEALTH1010W ip.pipe:#153.40.36.48[6015]<NM>ibmaswdbc2</NM> TNODESAV duplicate hostaddr in [ibmaswdbc2.nam.nsroot.net:KUX[rtems-gt3-ibm1][Y] ibmaswdbc2:KUX[rtems-gt3-ibm1][Y] ] 

That is two agents, both on the same system, both appearing online both claiming to use the same listening port.  [That is literally impossible from a TCP Socket usage standpoint.] Likely one agent was stopped and never actually completed shutting down and the other came up and reused the listening port. That is confusing and wrong and needs investigation. It isn’t horrible because monitoring is continuing but it is a waste of resources and a waste of human analysis time. In some cases it can cause performance problems at the TEMS and the TEPS.

Always remember there can be false positives and so each case needs to be handled separately.

Report Layout

First there is a general list of hub and remote TEMSes, TEPSes, i/5 agents and a FTO present statement.

Following that are advisory messages.

Following are many potential report sections. Some are shown only if the condition exists.

1) Top 20 most recently added or changed Situations

2) TEMA APAR Technical deficit – How far behind these Agents are compared with latest maintenance

3) Flapper Situations – situations which go on and off a lot

4) Pure Situations with DisplayItems Report – potential for serious storage growth

5) End of Service TEMA – Agents running past out of service levels

6) Future End of Service TEMA – Agents which will be out of service at an announced future date.

7) Maximum Top 20 agents showing online status more than the most common number. This is often a sign of duplicate agent name cases,

8) Top 20 Situation Event Report

9) TEMS Situation Load Impact Report – How heavily the hub and remote TEMS are loaded.

10) TEMA Agent(s) in APAR IZ76410 danger – agents can be connected to two remote TEMS at same time

11) TEMA Agent(s) in APAR IV18016 danger – agent looping during TEMS connection

12) TEMA Agent(s) in APAR IV30473 danger – KDEB_INTERFACELIST issues

13) Systems with Multiple TEMA levels

More will be added over time.

Don’t Panic!!!

The first time you run the report you may see many many advisories. Remember that the higher impact ones are the most important. You may have a new case where the analysis routine makes a false estimate of the condition. Some are relatively easy to resolve and some require IBM Support to resolve… you are not alone. Read through the documentation on messages and clear things up. As you correct problems your monitoring environment will run smoother.

These advisory messages are a personal judgment based on working with ITM for many years. There are no absolutes here and if one of the advisories seems incorrect then just ignore it and move on. Or contact the author and ask questions.  For ITM environments that have been running for many years, you will likely see a lot of 0 impact advisories. Usually these are things like Managed System Lists for products no longer installed or references to agents that have been decommissioned. It is sort of nice to clean these up, but the effect is usually minor. [I had one case where a customer had 100,000 Nodelist table objects and fully 35% were for MSLs no longer used at all. Removing them did improve TEMS performance.]

There are very rare cases where problems are caused by TEMS database file problems. One big help is the  division between data .DB file and index .IDX file. This Health Checker project uses the index file and when problems occur the index file is most commonly broken. The data file is rarely damaged* because all data is written into new areas and deleted records are simply updated with a “delete” flag added. Thus at any time the disk data file is consistent before and after an update.

IBM Support has tools to work directly with the data .DB files and can often recover data and then use that to rebuild the table. This can also be done with backup files. Every customer should have a best practice backup and recovery plan so please review this blog post co-written with L3 TEMS:

Sitworld: Best Practice TEMS Database Backup and Recovery.

I have worked on rare cases where the lack of a reliable backup caused a week+ outage and much manual rework. That happened maybe every other year. When making plans, please review the technote and chose one of the five ways to create a reliable backup. A simple copy of the files – manual or automatic – is useful in some ways but is not reliable enough for a restoration. If you have a reliable backup you can often use the Recovery process by yourself and minimize outage time. The document also has reference to needed emptytable files used during a recovery.

*Note: One memorable case had the data on a large Network Access Storage device that lost power and lots of files were seriously damaged.

Summary

This report shows problems in TEMS database tables.

This is a work in progress and will be updated periodically. If you have any feedback or insights about missing features, please communicate with the author.

Sitworld: Table of Contents

History and Earlier versions

If you wish to access more recent updates than the point releases below, access https://github.com/jalvo2014/datahealth.

If the current version of the Database Health Checker tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues.  If you discover an issue try intermediate levels to isolate where the problem was introduced.

datahealth.1.71000

Put Sampload at start

Add duplicate historical data collection report/advisory

add advisory on KBB_RAS1 trailing single quote

clarify some Agent APAR danger report titles

 

Appendix 1: Technical Deficit/Debt Explanation

Technical debt is a well known concept in program development and support over time.

My interpretation is: As systems evolve over time the need for changes arise and there is always a choice between

1) Ignoring the issue until you are forced to change and take the hit of losing customers or reputation or product benefit.

2) Making the simplest possible change with minimal effort disregarding design and good practice.

3) Redesigning the system to avoid the problem and adding regression tests to detect similar issues in the future.

Any of the choices are legitimate business responses. Choice (3) means you are investing in the future to reduce future costs and enhance the customer experience.    (1) and (2) limit current costs while postponing changes and costs into the future. (2) vastly increases the cost of later changes since it is a patch or quirk which usually makes redesign much more costly.

Technical Debt is something like a credit card purchase: You get the benefit now and postpone costs to the future.

One author related it to a stock market call option. In that transaction you sell the right to a future purchase of a commodity at a fixed cost. You get money now and if the commodity drops in price you can deliver the goods and make money. If the commodity increases in price too much you lose money. (1) and (2) is like selling a call and expecting that the future costs will not be too high later on. Hope springs eternal but is rarely rewarded.

This TEMA Deficit report section judges failure to update OS Agents as an example of (1) – Ignoring the issues until you are forced to make a change. Making updates to handle specific issues is an example of (2). Updating agents to prevent known defects is an example of (3) – adapting to avoid future problems.

Not updating agents is taking on a Technical Debt. That path increases costs and reduces product benefits – and should be avoided. The report section identifies how many APAR fixes you are in deficit. For a report with more detail please see ITM Agent Census Scorecard, which lists the name and abstract of each APAR fix affecting agents and how many agents are affected. Following that are more detailed reports so you can drill down to your hearts content. That can be highly valuable when determining when an agent upgrade makes sense.

Photo Note: Wood Carving of the Mythical Phoenix – Big Sur, California  Nepenthe Restaurant

 

7 thoughts on “Sitworld: ITM Database Health Checker

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: