John Alvord, IBM Corporation
Draft #5 – 7 September – Level 1.07000
I recently worked with a customer who had an unstable TEMS environment. There were 3152 agents and 13 remote TEMS. The workload wasn’t high. The instability manifested as remote TEMS missing heartbeats. This resulted in remote TEMSes going offline which was very disruptive. In one 14 hour period the hub TEMS noted 60 missed heartbeats.
One unusual observation in the diagnostic log was that the Listen pipes count was 21 – much higher then normal. I had seen that once before and documented it in the ITM and the 1997 Kasparov vs. Deep Blue Chess Match blog post. In that case there were SP2OS UADVISOR situations that were updating virtual tables at the hub TEMS and those tables were unused and/or unusable.
I saw a mental image of that virtual table update process as termites – silently digesting the ITM infrastructure until a sudden collapse here and there. The parallel is inexact since the building [TEMS] can be rebuilt [restarted] at any time but the image felt right.
Suspecting the SP2OS type I reviewed the Agent types in that environment. There were 200 Unix OS Agents which would fire off virtual table updates every 3 minutes. There were 80 database [mostly MS-SQL] agents that would fire off every 2 minutes. The actual data volume arriving at the hub TEMS wasn’t high but the impact of the agents sending data in at the same time could destabilize ITM communications. Look at this timeline of theoretical update arrivals over time.
12:02 – 80 DB vtable updates arrive
12:03 – 200 UX vtable updates arrive
12:04 – 80 DB vtable updates arrive
12:06 – 280 UX/DB vtables arrive.
Since initial design ITM defaulted to 16 communication services threads. At ITM 623 FP5 the default was set to 32. At ITM 630 FP6, the default was set to 64. This new case was a ITM 623 FP5 case.
By reviewing this ProcessTable trace
error (unit:kdsstc1,Entry=”ProcessTable” all er)
I could see that only 33 of 200 UX table updates were being processed. Only 22 of the 80 DB vtable updates were being processed. That discovery proved to me that the vtable updates were seriously impacting TEMS processing. I imagined a parallel where 200 text messages would arrive at cellphone a single second and how that might destabilize the cellphone software.
Recovery Action and Successful Result
SQL was manually prepared to delete all of the objects relating to these virtual table updates. The customer ran the SQLs using KfwSQLClient and then restarted all the hub and remote TEMS. There was no loss of function because some of the tables were unused and in any case the tables are partial and thus at best confusing.
The difference was astonishing. In the next 30 hours there was just a single missed heartbeat and that was one which was a remote TEMS scheduled maintenance time. The peak Listen Pipe count dropped from 21 to 2 [two!!]. And most things were running smoothly. The hub TEMS CPU time rose slightly [0.92% to 1.07%] – mainly because all the remote TEMS were staying connected and sending work instead of spending significant help time in Offline status.
Self Help Recovery Tool.
This post presents a recovery tool which creates two SQL files and two command files
show.sql – display virtual table update objects
delete.sql – delete virtual table update objects
recycle.sh – shell command to recycle affected agents [included in recycle.tar file]
recycle.cmd – Windows command to recycle affected agents
The tool is based on the TEMS Agent Health survey and uses the same proven logic. The binary objects are here sitvtbl.1.08000.
ITM Virtual Table Recovery Installation
The virtual table recovery package includes one Perl program that uses CPAN modules. The program has been tested in several environments. Window had the most intense testing. It was also tested on zLinux. Many Perl 5 levels and CPAN package levels will be usable. Here are the details of the testing environments.
- ActiveState Perl in the Windows environment which can be found here. www.activestate.com
This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail)
2) Perl on zLinux
# perl -v
This is perl, v5.10.0 built for s390x-linux-thread-multi
CPAN is a collection of free to use packages. In your Perl environment, there may be some installed CPAN modules and virtual table recovery may need more. Here are the modules used.
Getopt::Long in CPAN Getopt-Long 2.41
SOAP::Lite; in CPAN Soap-Lite 1.06
HTTP::Message; in CPAN HTTP-Message 6.06
XML::TreePP; in CPAN XML-TreePP 0.41
You might discover the need for other CPAN modules as the programs are run for the first time. The programs will likely work at other CPAN module levels but this is what was most recently tested.
The Windows Activestate Perl environment uses the Perl Package Manager to acquire the needed CPAN modules. The Agent Survey technote has an appendix showing usage of that manager program with screen captures.
Please note: In some environments it is a major problem to install the required up to date CPAN packages. Internet access may not be available or Perl may be a shared resource which you do not have the right to change. Changing such packages could negatively affect other programs. To manage this case a zip file is included: the URL is sitvtbl.1.08000. See the History section following the summary if you need earlier levels. You can get this or earlier levels. The zip file is useful for both Windows and Linux/Unix. For Windows the zip file contains a directory “inc” which contains the needed CPAN packages. For Linux/Unix the zip file contains a tar file unix-cpan-inc.tar. Transfer that to the Linux/Unix system and untar it like this:
tar -xf unix-cpan-inc.tar
That will create a directory “inc”.
If you need this CPAN directory, add the following parameter to the program invocation “-Iinc” – which is dash then capital I followed by the directory name inc.
perl -Iinc sitvtbl.pl <rest of parms>
In that way the CPAN packages are used only for this one program,
The supplied files are
1) sitvtbl.pl and a model sitvtbl.ini file.
2) sitvtbl.cmd, sitvtbl.sh and sitvtbl.tar files to get the needed data.
To install the virtual table recovery package, unzip or untar the file contents into a convenient directory. The package also includes a model sitvtbl.ini file. The soap control is required [see later for discussion] for the SOAP option. The userid and password may be supplied in the agent.ini. In this case the sitvtbl.ini file looks like this
The user and password credentials may be supplied from standard input. This increases security by ensuring that no user or password is kept in any permanent disk file. In this case the sitvtbl.ini file would look like this:
The std option can also be supplied on the command line -std. In either case, a program must supply the userid and password in this form
-user <userid> -passwd <password>
The program invocation would be something like this
mycreds | perl …
ITM Virtual Table Recovery Configuration and Usage
The Agent virtual table recovery package has controls to match installation requirements but the defaults work in most cases. Some controls are in the command line options and some are in the sitvtbl.ini file. Following is a full list of the controls.
The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.
|-lst||n/a||off||sitvtbl.cmd or sitvtbl.sh|
|-log||log||./sitvtbl.log||Name of log file|
|-ini||n/a||./sitvtbl.ini||Name of ini file|
|-debuglevel||n/a||90||Control message volume|
|-debug||n/a||off||Turn on some debug points|
|-dpr||n/a||off||Dump internal data arrays|
|-v||verbose||off||Messages on console also|
|-vt||traffic||off||Create traffic.txt [large]|
|n/a||soap_timeout||180||Wait for soap|
|n/’a||soap||<required>||SOAP access information|
|-std||std||Off||Userid/password in stdin|
|-user||user||<required>||Userid to access SOAP|
|-passwd||passwd||<null>||Password to access SOAP|
|-recycle||n/a||1||create recycle command/shell files|
Many of the command line entries and ini controls are self explanatory.
soap specifies how to access the SOAP process with the name or ip address of the server running the hub TEMS. See next section for a discussion.
soap_timeout controls how long the SOAP process will wait for a response. One of the agent failure modes is to not respond to real time data requests. This default is 180 seconds. It might need to be made longer in some complex environments. A value of 90 seconds resulted in a small number of failures [2 agents] in a test environment with 6000 agents.
ITM Virtual Table Recovery Package KfwSQLClient or lst Option
This method is easier to implement then the SOAP option because it does not need any CPAN modules. The work is done on the same system that has a TEPS connecting to the hub TEMS. See the following section “ITM Virtual Table Recovery Outputs” for how to use the results.
a) Save files in the directory where it will be run, like /opt/IBM/ITM/tmp
b) If not using default install directory configure like this: export CANDLEHOME=/opt/IBM/ITM
c) Run this command tar -xf sitvtbl.tar
d) Run this command: sh sitvtbl.sh
e) Run this command: perl sitvtbl.pl -lst
a) Save files in the directory where it will be run, like c:\ibm\ITM\tmp [which might have to be created]
b) If not using default install directory configure like this: SET CANDLE_HOME=c:\IBM\ITM
c) Run this command: sitvtbl.cmd
d) Run this command: perl sitvtbl.pl -lst
At this point the show.sql and delete.sql are created
The /v options produces progress and result messages.
If needed you could sitvtbl.sh or sitvtbl.cmd at the TEPS system and copy the .LST file somewhere Perl is installed.
ITM Virtual Table Recovery Package soap control
The soap control specifies how to access the SOAP process. For a simple ITM installation using default communication controls, specify the name or ip address of the server running the hub TEMS. If you know the primary hub TEMS a single soap control is least expensive.
If the ITM installation is configured with hot standby or FTO there are two hub TEMS. At any one time one TEMS will have the primary role and the other TEMS will have the backup role. If the TEMS maintenance level is ITM 622 or later, set two soap controls which specify the name or ip address of each hub TEMS server. The TEMS with the primary role will be determined dynamically.
Before ITM 622 you should determine ahead of time which TEMS is running as the primary and set the single soap control appropriately.
Connection processing follows the tacmd login logic. It will first use https protocol on port 3661 and then use http protocol on 1920. If the SOAP server is not present on that ITM process, a virtual index.xml file is retrieved and the port that SOAP is actually using is retrieved and used if it exists.
Various failure cases can occur.
- The target name or IP address may be incorrect.
- Communication outages can block access to the servers.
- The TEMS task may not be running and there is no SOAP process.
- The TEMS may be a remote TEMS which does not run the SOAP process.
- The SOAP process may use an alternate port and firewall rules block access.
The recovery actions for the various errors are pretty clear. If (5) is in effect, consider running the survey package on a server which is not affected by firewall rules. Alternatively, always make sure that the hub TEMS is the first process started. If it must be recycled, then stop all other ITM processes first and restart them after the TEMS recycle. See this blog post which shows how to configure a stable SOAP port at the hub TEMS.
If the protocol is specified in the soap control only that protocol will be tried.
When the port number is specified in the soap control, 3661 will force https protocol and 1920 will force http protocol.
The ITM environment can be configured to use alternate internal web server access ports using the HTTP and HTTPS protocol modifiers. For this case you can specify the ports to be used
or if both have been altered
The logic generally follows tacmd login processing. There are two differences: ipv6 is not supported and port following ITM 6.1 style is not included. SOAP::Lite does not support ipv6 at present. ITM 6.1 logic could be added but is relatively rare and was not available for testing.
ITM Virtual Table Recovery Outputs
There are two files of SQL produced: show.sql and delete.sql. The easiest way to use them is documented here Do It Yourself TEMS Table Display. Use the KfwSQLClient method which is the second one listed.
First run the show.sql and determine if the problem objects are present. If not you can stop here. Some may be missing because the SELECT SQLs look at all possible problem objects and you may not have all such agents installed.
Second run the delete.sql. In general there will be no error messages.
Third run the show.sql again. This should return no data.
If some remote TEMS are offline, this may need to be repeated. There is no harm in deleting objects that does not exist.
If a new remote TEMS is created, rerun this process.
If TEMS maintenance is performed, rerun this process.
When delete.sql is not Enough
The tacmd restartAgent function used to restart most agents is sometimes unable to function. First it depends on having an OS Agent on the same system as the target agent. Second the agent name must have a proper suffix. For example an Agent for Microsoft MS-SQL usually has a suffix of “.MSS”. When those rules are violated, the restartAgent function can not operate – usually because of a too long hostname. In the following paragraph recycling the hub and remote TEMSes will be the only way to implement the recovery action for those agents.
The recycle.sh/recycle.cmd shell files will present those agent conditions.
Consider reconfiguring the agents so they have properly formed names and having an OS Agent on the same system.
Removing the effects of the Virtual Hub Table Updates
After the problem objects are deleted, there are two ways to proceed.
1) In many ways the easiest is to recycle the hub TEMS [and the backup hub TEMS if FTO is used] and all the remote TEMSes. This does not need to be done all at once, but the benefit will be gradually seen as each one is recycled.
2) Another way is to use the recycle Command/Shell files. Here is an example assuming where the files were stored.
tacmd login -s … login to hub TEMS
./tacmd login -s … login to hub TEMS
Long Term, the agents involved will have new application support files which will not have these virtual table update objects. This is now true for recent Unix OS Agents.
The virtual table recovery tool was derived from Agent Survey.
Please report back experience and suggestions. If virtual table recovery package does not work well in your environment, repeat the test and add “-debuglevel 300” and after a retry send the sitvtbl.log [compressed] for analysis.
History and Earlier versions
If the current version of the virtual table recovery tool does not work, you can try older published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue test intermediate levels to isolate where the problem was introduced.
Identify cases where restart agents do not have online OS Agents
Warn for cases where agent name does not have proper suffix
Consolidate duplicated logic
Correct sitvtbl.cmd and sitvtbl.sh and sitvtbl.tar