John Alvord, IBM Corporation
A customer hub TEMS was crashing on startup. This is a painful moment because that means that monitoring is momentarily impossible.
A review of the logs showed that one particular situation was starting, there were a lot of action commands recorded in the hub TEMS operations log and then the failure.
At startup the TEMS does a lot of things including starting situations. If a situation causes an immediate failure then it needs to be prevented from starting up.This particular situation was MS_Offline type. The product provided situation formula looks like this
( Status == *OFFLINE AND Reason != ‘FA’)
The meaning is to check the node status table rows for offline status. The event is not true during startup or other periods when offline is not meaningful – when the Reason attribute is FA [Framework Architecture] is set.
In this case the customer created situation formula was
( Status == *OFFLINE AND PRODUCT == UX )
The added test was for a particular type of agent – UX for Unix OS Agent. They had almost 2000 Unix OS Agents. The FA test was missing.
In addition the customer created an action command to send an email when that condition was true. That was the root cause of the failure.
See this post Sitworld: MS_Offline: Myth and Reality for a more in depth discussion on MS_Offline situations.
The new project Sitworld: ITM Situation Audit will detect this case, missing Reason NE FA test in a static analysis of all situations.
Functioning of TEMS failure
At TEMS startup the user MS_Offline type situation was started when SITMON began working. Most agents were not connected and were recorded as offline. The result was about 2000 situation events occurred during startup. Each event triggered running an action command to send email. All these line mode commands ran in the same process space as the TEMS – at the “same time”.
This exceeded the process size limits and the TEMS failed. In other cases the system paging space was exceeded which also triggered a different type of failure but even worse since all processes on the system were affected.
Recovery Action Part 1 – run the TEMS without SITMON
The file changed is KBBENV which is in this directory
z/OS: RKANPARU – KDSENV member
1) Stop the TEMS – if not already stopped
2) Copy the KBBENV file to KBBENVS
3) Edit the KBBENV file
4) Locate the KDS_RUN= line and remove the “KSMOMS.KSMOMS;” entry.
5) Add the following to the end of KBBENV – CMS_SKIP_SITMON=Y
6) Start the hub TEMS
Recovery Action 2 – New Simpler Style
When the hub TEMS is started, start up a TEP session. TEP will be operational except nothing much will be going on. In that TEP session start up Situation Editor against the problem situation and set the Formula page control so “Run At Startup” is clicked off. Click Apply.
Recovery Action 3 – Restore normal Activity
Stop the TEMS, restore the saved KBBENV and start the TEMS. Then TEMS should be running normally. At this point the problem situation needs to be rethought.
This shows how to make manual configuration changes so that TEMS will temporarily start up without SITMON.
Photo Note: New Bedroom Under Construction in Carmel Highlands – February 2013
Appendix 1: The original instructions which are of some interest historically, but take much more time and effort to accomplish.
Recovery Action Part 2 Change the problem situation to not run at startup
In most cases you will be working with IBM Support to determine the problem situations.
In some cases you may be able to start TEMS and the TEPS and make a Portal Client session. If so you can just delete the situation or at least change it so Run at Startup is clicked off. If that works continue to Recovery Action Part 3.
The general instructions look like this for Linux/Unix
1) Login to system running the hub TEMS
2) cd <installdir>/tmp
3) Make a new directory
4) In the new sqllib directory create a file runoff.sql with the following contents
SET AUTOSTART = ‘*NO’
WHERE SITNAME = ‘<sitname>’;
5) In the <installdir>/tmp directory create a new file runoff.txt with these contents
You would need to tailor this with the correct protocol, ip address , port and sql file name depending on your environment
6) That is the end of the preparation.
To run this first locate the TEMS config file
7) ls <installdir>/config/*.config
In the test case it was /opt/IBM/ITM/config/ nmp182_ms_HUB_NMP182.config
9) source include that file. In my test case that is
that is a period, then a blank and then the TEMS config file. This prepares the environment to run kdstsns
10) run the SQL doing these two commands
For Windows the instructions are about the same. You do the work in a newly created <installdir>/tmp. kdstsns runs more simply with a