Sitworld: Event History #1 The Situation That Fired Oddly


John Alvord, IBM Corporation

Draft #1 – 4 April 2018 – Level 1.00000

Follow on twitter


The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation That Created Too Many Events

Title: False ESX reboot alert

Description: … We are getting false alert for ESX up time monitoring, it is happening randomly for different servers . … I found Alert get triggered when system up time show values as 4294967295 . it is happening for all those false triggered alert…

The situation formula as seen from a tacmd viewsit was simple.

Formula        : *IF *VALUE KVM_SERVER.System_up_time *LT 600

Sampling Interval   : 0/0:5:0

Just after startup the system being monitored would have an up time below 600 seconds. That formula creates an situation event. A while later the up goes above 600 seconds and the event would close.

Event History Audit

The new tool was run on the remote TEMS database files which were sent with the problem report. If you were doing this yourself you would use the script. The example follows that end user path.

perl -lst -allresults

The -allresults means the last report section displays all the results in an easier to understand format. Usually it shows only the situations that triggered advisories.

That final report section is displayed in order by 1) node or agent name, 2) Thrunode [usually remote TEMS], 3) Situation Name, 4) DisplayItem, 5) by TEMS processing second. There is a massive amount of information present and I will show one small snippet of the report relating to the problem at hand.

Here are the lines of interest:

IBM_ESXReboot_W_Test,VM:XXXX232V-ibmesxcdc030:ESX,REMOTE_IBM010,1180327071516999,1180327071516000,Y,300,1,KVMSERVERG.SH,,1891,*IF *VALUE KVM_SERVER.System_up_time *LT 600,



This is a little messy to view but it is simple compared to some. The first line is a header summary which starts off with the situation name, the agent name, the remote TEMS, agent time, TEMS Time, Status [Y=open], 300 [seconds sampling interval]. number of results, the DisplayItem value and the Situation Formula or PDT. In the full report section there is a header title line.

The second line is the predicate or formula summary. The names here all use the Attribute Group Table and Attribute Column. It is exactly parallel to the first line PDT:

The third line shows the attributes sent with the situation being open. In particular note the KVMSERVERG.SUT=4294967295;   Using a Decimal to Hex web calculator that number is equivalent to X’FFFFFFFF’ and in signed arithmetic that is -1. Numeric attributes are usually kept in signed 4 byte integers.

In summary the situation fired because  -1 < 600   is a true statement. The Agent also continued to send the identical information every 300 seconds – which is how situation processing works. This was done at 22 agents [connecting to this remote TEMS] and comprised 4.27% of the estimated Situation workload.

What is -1?

Much data gathered by the agent comes from calls using APIs to the monitored system.  The negative values typically mean the API call could not return the data for some reason. Sometimes that is documented in the Agent Manual, sometimes not. It cannot really be an actual second number because 4294967295 seconds would be roughly 68 years.

Situation Formula reworked

The rework is rather easy.

*IF *VALUE KVM_SERVER.System_up_time *LT 600 *AND *VALUE KVM_SERVER.System_up_time *GE 0

will screen out the negative values.

It might be that -1 is a signal of a true error condition. That would probably need to be worked out with the agent support people or perhaps the vendor and API usage. In that case you could have a separate situation to track them down and fix them.

*IF *VALUE  KVM_SERVER.System_up_time *LT 0


Tale #1 of using Event Audit History to diagnose a Situation mystery.

Sitworld: Table of Contents

History and Earlier versions

There are no binary object associated with this project.


initial release

Photo Note: Between Deck Vent on Cruise Ship Build 2018


One thought on “Sitworld: Event History #1 The Situation That Fired Oddly

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: