Sitworld: Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second

whatsnext

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 17 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

50,EVENTAUDIT1013W,TEMS,Situations [9] lost [merged] events Multiple Results  with same DisplayItem at same TEMS second – see EVENTREPORT004

This particular case showed up when the DisplayItem was configured however two results were produced and processed at the TEMS at the same TEMS second. The Atomize value was identical however and therefore the second result did not produce an event. Therefore an event was lost.

And in that report section:

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

bnc_errpt_xulm_aixgr_01,Pure,1180410020221000,2,bnc_viomtl18x:KUL,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,

There is a situation bnc_errpt_xulm_aixgr_01. It is a Pure situation and at Agent second  1180410020221000 on agent bnc_viomtl18x:KUL there were 2 results.  The DisplayItem was ULLOGENT.UENTRYDESC and the Atomize value was “DC73C03A: SOFTWARE PROGRAM ERROR”. As a result an event was hidden.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

bnc_errpt_xulm_aixgr_01,bnc_viomtl18x:KUL,REMOTE_camtram070nbfra,1180410020217999,1180410020221000,Y,0,2,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,2227,*IF *VALUE Log_Entries.Log_Name_U *EQ ‘errlog’ *AND *VALUE Log_Entries.Class *EQ Software *AND *SCAN Log_Entries.Description *NE ‘A924A5FC:’ *AND *SCAN Log_Entries.Description *NE ‘813FE820:’ *AND *SCAN Log_Entries.Description *NE ‘8FED25B9:’ *AND *SCAN Log_Entries.Description *NE ‘C5C09FFA:’ *AND *SCAN Log_Entries.Description *NE ‘1BA7DF4E:’ *AND *SCAN Log_Entries.Description *NE ‘A6DF45AA’ *AND *VALUE Log_Entries.Type *IN (‘P’,’T’,’U’) *AND *SCAN Log_Entries.Description *NE 0873CF9F *AND *SCAN Log_Entries.Description *NE 2F64580C *AND *SCAN Log_Entries.Description *NE 573790AA *AND *SCAN Log_Entries.Description *NE FE2DEE00,

,

Situation was bnc_errpt_xulm_aixgr_01, agent was bnc_viomtl18x:KUL. thrunode was REMOTE_camtram070nbfra. Agent_time was 1180410020217999 and TEMS_time was 1180410020221000, 4 seconds later. It was an Open event [Y], the DisplayItem ULLOGENT.UENTRYDESC with a value of  “DC73C03A: SOFTWARE PROGRAM ERROR”. There is a line number 2227 of the input. The PDT is shown also,re was no DisplayItem, sampling interval was 300 seconds was and the situation formula checked for the existence of a certain mount point.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=ULLOGENT.ULOGNAME = N’errlog’ AND ULLOGENT.ENTRYCLASS = ‘S’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘A924A5FC:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘813FE820:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘8FED25B9:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘C5C09FFA:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘1BA7DF4E:’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘A6DF45AA’) = 0 AND ( ULLOGENT.ENTRYTYPE = ‘P’ OR ULLOGENT.ENTRYTYPE = ‘T’ OR ULLOGENT.ENTRYTYPE = ‘U’ ) AND STRSCAN(ULLOGENT.ENTRYDESC, ‘0873CF9F’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘2F64580C’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘573790AA’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘FE2DEE00’) = 0,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020043000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,1,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020045000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

What is the problem and How to fix it?

As can be seen, the attributes are identical except for ULLOGENT.TIMESTAMP which is 1180410020043000 and 1180410020045000. It might not much difference that the same identical result is seen once or twice.  However if it does make a difference, the TEMS can be configured for Pure Situations to force a separate situation event for each result row.

ITM Pure Situation events and Event Merging Logic

On the other hand, loading down the agent and the TEMS processing the same data over and over seems wasteful. The Tivoli Log Agent can be configured to filter duplicate events and thus save resources at the Agent and the TEMS. Or the situation formula can be altered to exclude this case.

A Warning Note

This situations may violate the general guidance that situations should be Rare, Exceptional and Fixable. It certainly doesn’t seem rare. It feels like someone should fix the issue and thus avoid the monitoring overhead completely.

Summary

Tale #6 of using Event Audit History to understand and correct a  type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: What is Next? – Cruise Ship Build 2016

 

One thought on “Sitworld: Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: