Sitworld: TEMS Audit Process and Tool

ArtDecoFeline

Version 2.15000 11 September 2019

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days. This condition is usually caused by workloads, configuration or environmental conditions. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved. In addition, there may be important error messages in the diagnostic log.

Background

In 2008 I began work on a year-old customer problem that took until Spring 2010 to resolve. The conclusion was that a certain type of situation caused a severe storage fragmentation and a TEMS failure in 10-12 days. The customer decided to recycle the affected remote TEMSes once a week. That was probably the single most expensive PMR [for IBM and customer] I ever worked on.

Some 6 months later in Dec 2010 I had a customer with six AIX servers running at 95% utilization and all from remote TEMS processing. The solution was presented but the customer was unconvinced. I wrote a Perl program to summarize the results in a spreadsheet file. The customer was convinced, made the changes and those six systems dropped to 10% utilization. In March 2011, I published the process and tool as a technical note and it is now widely used.

Recent changes

Since the last published version 1.37000 [28 advisories] 38 new advisories have been added and many report sections to explain the environment further. At level 1.64000 the documentation has been reworked for simplicity and clarity.

  1. The TEMS Audit report itself includes an explanation of each advisory and that is not duplicated in this document.  The temsaud.pl program can be viewed to see each advisory explanation.
  2. The separate word file documents have been eliminated and information added here, revised and extended.
  3. Each report section is documented along with needed traces.

These changes will continue. The source in every case is based on diagnosing a specific customer issue and when possible adds an advisory or report section which would speed the diagnostic effort – even allowing the customer to correct the condition whenever possible. In recent months there have been several advisories and/or report sections added each month.

TL;DR – Too Long; Didn’t Read

Sometimes you want immediate results and to heck with the details. Here are the instant use instructions. Unpack the most recent zip file and copy the temsaud.pl and testaud.ini files to a convenient location where Perl is installed [all Unix and Linux environments] and on the system where a TEMS is running. Let’s pick /tmp and the TEMS is installed at the default directory. Now run this command

perl temsaud.pl -v -logpath /opt/IBM/ITM/logs

and if there should happen to be more than one candidate inventory file,

perl temsaud.pl -v -logpath /opt/IBM/ITM/log hostname_ms_kdsmain.inv

and then view the report file temsaud.csv. The end of that file includes an explanation of each advisory. If you want more information about the TEMS workload then add the needed traces documented later.

Linux and Unix systems usually have Perl installed.

On Windows, you can install the community edition at www.activestate.com. You can also just copy the needed files to a system where Perl is installed.

For a z/OS TEMS trace, capture the RKLVLOG sysout log into a file and run temsaud.pl with the -z option. I worked to make that automatic but have not found the secret sauce.

Easy access to TEMS Audit report file – Just ask for it!

If you have a Case or PMR and send in a pdcollect of a TEMS, a TEMS Audit report will usually be created. [There are rare cases where files are missing and the report cannot be created]. You can request that the TEMS Audit report be sent to you by the support engineer working on the Case/PMR.

Installation

Perl is usually preinstalled in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. Alternatively you can copy the needed log files to a system where Perl is installed. The program only uses Perl core services, No CPAN modules are needed.

TEMS Audit has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

Files

There are two TEMS Audit program files in a zip file here.

1) temsaud.pl                                      Perl program

2) temsaud.ini                                     Option file for setting nominal values

Install them somewhere convenient.

Run Time Options.

There are many but temsaud.pl does not require any of them.

Options:

-b                                 Show HEARTBEATs in Managed System report

-cmdall                        When cmd trace is present, report all commands

-expslot <mins>          Historical export report, display slot size – default 60 mins

-flip                              Produce detailed report on agent location flipping, ITM 630 FP7

-h                                 display help information

-jitall                             produce advisory on minor level heartbeat jitter

-jitter                             produce jitter reports, default no jitter reports

-logpath                       Directory path to TEMS logs – default current directory

-o                                 Specify the output report filename

-odir                             Specify directory where output files will be stored

-ossdir                         What directory to write send status report

-nohdr                          Remove variable header lines – used for regression testing

-rd                                Result detail reports

-ri <secs>                    result workload report by interval, default 60 seconds

-rdslot                          Seconds in result detail slots, default 60 seconds

-sqldetail                      produce a SQL detail report by source and table involved

-sr                                SOAP Report minute by minute – soap_detail.txt

-ss                               produce a report send status – used for multiple remote TEMSes

-sum                            produce a one line summary file

-v                                 Produce some progress messages in STDERR

-work                           Copy files to work directory before analyzing.

Environment variable Windows – TEMP, Linux/Unix tmp

-workpath                    Directory path to work directory, default is the system

-z                                 z/OS RKLVLOG log   

-eph                             Capture ephemeral file report

-ephdir                         Where to store ephemeral file report

The remaining parameter [if present] is a log file specification. Often no specification is needed. It will look at the -logpath setting and determine the current logs from the inventory file. Sometimes there is more than one candidate and you can specify it like

Hostname_ms_kdsmain.inv

You can also specify log files to study as a partial name such as

ibms98htems_ms_52052f96

That setting analyzes multiple log sections with the same file name starting string. That can be extremely useful of the .inv file is missing or was not sent with the diagnostic files to be studied. It is also the way to analyze a diagnostic log set that is not the most current.

In hands-free operation, do not specify a specific file. Instead use -logpath to specify a path to a normal TEMS logs directory or leave it off for the current directory. In hands-free mode the directory is examined to determine the current active log segments. If there are multiple .inv files the one with the most recent diagnostic logs is chosen. The log segments are processed in the sequential time order. When -work option is set, the diagnostic logs are copied to a work directory to avoid the issue where a log is deleted while being used. Specify the. inv file in case there are more than one candidate.

Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain messages. That said there are times when you will need to manually assemble a single combined log to analyze. That can be true if traces are turned on dynamically. Some of the reports use the time to calculate rates and isolating just the traced log makes that more accurate.

The default output report is temsaud.csv. Use the -o option to specify your own name for the report.

One section of the report is advisories. If the maximum advisory impact level is equal or greater than 50, the program ends with exit code 1 and otherwise 0. That can be used in shell files to control other actions – like sending email.

Run examples:

perl temsaud.pl cmsall.log 

perl temsaud.pl -v

Reports and Diagnostic tracing needed

The reports are intended to be self-contained and easy to use. With no tracing beyond the default ERROR you get a lot of useful information. To get full value you need to set tracing. See post on how to set tracing. The following shows a report and how to get extra value by setting diagnostic tracing.

Header

This labels the report and TEMS Audit version number, when run, the runtime parameters, the logs used [if -v is specified] and a line which is the TEMS nodeid and *LOCAL for hub TEMS or *REMOTE for remote TEMS. Occasionally the last line is missing because not in the logs. The header can be suppressed using the -nohdr option.

TEMS Audit report v1.63

Start: 2017-05-24 15:44:38

Runtime parameters: -sum -rd -flip -odir /ecurep/tmp/ITM_Health/Awn0gXM2ky/ -logpath /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs w94486_ms_59243128

“working on 0 /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs/w94486_ms_59243128-01.log”

HUB_w94486 *LOCAL

Advisories

The header is followed by advisory messages. These are indications of errors and warnings from the diagnostic log. The advisories are sorted in reverse order by impact. These impact values selected based on years of experience however feel free to use you own judgement. These are Advisory and not The Law. The following is just one recently created. There are 56 advisories at ths\is writing and more are added all the time.

Advisory Message Report – *NOTE* See advisory notes at report end

Impact,Advisory Code,Object,Advisory,

200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times

100,TEMSAUDIT1050W,PCB,Agent connection churning on [1] systems total[5] – See following report

90,TEMSAUDIT1055W,Agent,Agent Location Flipping Changes Detected [11] – See following report

90,TEMSAUDIT1061E,TEMS,Situation [ddb_sanprb_x07c_aix_gen] with unknown attribute [K07_SANPATH.MonitoringSolution] – [*IF ( ( *VALUE K07_SANPATH.RC *EQ ’20’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’21’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’22’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’23’ ) ) *AND *VALUE K07_SANPATH.MonitoringSolution *EQ ‘aix’]

80,TEMSAUDIT1016W,Onlines,3 Agents with repeated onlines

80,TEMSAUDIT1035W,diagnostic,7 worrying diagnostic messages – see later report section

80,TEMSAUDIT1054W,SOAP,SOAP Errors Types [2] Detected – See following report

50,TEMSAUDIT1021W,timeout,4 Agent time out messages

0,TEMSAUDIT1005W,Stack,ulimit stack [4294967296] is above nominal [10240] (kbytes) 

The pattern of each advisory is an impact level, an advisory identification, a category and a text message. Some reference later report sections. The first one above

200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times

Impact: 200 The highest current level.

ID: TEMSAUDIT1052E a sequential number

Category: HUB the hub TEMS is involved

Text: Hub TEMS has lost connection to HUB 84 times

At the end of the temsaud.csv report you find the following explanations of each advisory. Here is the one for the issue above.

Advisory code: TEMSAUDIT1052E

Impact:200

Text: Hub TEMS has lost connection to HUB count times

Tracing: error

(58D77E39.0002-11A0:ko4mgtsk.cpp,133,”ManagedTask::sendStatus”) Connection to HUB lost – stopping situation status insert

Meaning: This is produced when one hub TEMS task can no longer connect with the dataserver function of the hub TEMS. This is a severe error which is only recovered with a hub TEMS recycle.

The hub TEMS will not recover without a recycle.

This is a most severe issue and needs prompt diagnosis. It can be caused by many reasons including TEMS database problems, network problems and excessive workload.

Recovery plan: Contact IBM support for aid in diagnosis. 

<end of advisory explanation>

Reports

The reports are documented in the temsaud.csv report at the end. Here is an example:

TEMSAUDIT1067E

Text: Unable to get attributes for table <table> [<count> times]

Tracing: error

(591ACD2E.0000-7C:kshcat.cpp,296,”RetrieveTableByTableName”) Unable to get attributes for table tree TOBJACCL

Meaning: The SOAP request failed. This hints of an out of storage

issue, a broken attribute collection or mssing application

support files.

Recovery plan: Work with IBM Support to resolve issue. 

Tracing Guide in a Separate Document

The Tracing guide is documented in a separate post because of a document size limit:

sitworld: TEMS Audit Tracing Guide Tracing Guide Appendix

The tracing appendix is in a separate post because of a document size limit. It shows exactly how to set diagnostic traces for TEMS Audit both statically and dynamically.

Summary

This document shows how to collect performance data from a TEMS. As a result you can make operations more efficient and complete.

  Versions:

If you wish to track recent updates access the github project  https://github.com/jalvo2014/temsaud.

Here are recently published versions, In case there is a problem at one level you can always back up.

2.15000

Add better duplicate agent report082

Add report about rpc errors

Summary

Identify and correct workload and configuration problems. I encourage anyone to share success stories, enhancement requests or problems found.

Sitworld: Table of Contents

Note: Art Deco Cat sculpture

 

17 thoughts on “Sitworld: TEMS Audit Process and Tool

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: