Sitworld: Event History #12 High Impact Situations And Much More

OpenAirPool

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 1 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was about diagnosing a situation that produces a lot of results and what that means and what to do. Then it became more complicated and interesting.

This was seen in the Summary Report

Total Result Bytes: 11698309 49.38 K/min Worry[9.88%]

49.38K bytes per minute incoming to a hub TEMS is not usually something to worry about. The standard worry point is 500K bytes per minute. Even so it is always interesting to review situations which are dominating and can be rethought. If case you ever wondered, the maximum incoming result rate ever seen was 93 megs/minute and 128 remote TEMS were crippled.

This was seen in the Report011: Event/Results Budget Situations Report by Result Bytes

EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT

wlp_fretbsp_grzw_std,KRZTSOVEW,384,180,13,7.26%,0.06,14415,5535360,47.32%,13986,5370624,0,0,0,0,422,162048,0,0,13,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

For this discussion, we look at column 8,9 and 10.

8 Results                             14415

9 ResultBytes                     5535360

10 Result%                        47.32%

22 Nodes                           1

These are estimated based on the Event History Status table. If the TEMS has been running a long time and the event history table has wrapped, there could well be unseen workload components. You need a TEMS Audit report to directly measure incoming result workload to be accurate.

This one situation wlp_fretbsp_grzw_std was observed running on a single agent and was producing almost half the incoming result workload. Again the impact isn’t high but it does seem unusual and needs further analysis. The formula excludes a lot of table types and then looks for a percent free to allocated  between 5% and 10%. That is the sort of formula that may violate the general rule of good situations – Rare/Exceptional and Fixable. The fact there are so many positive results strongly suggest that it is neither Rare nor Exceptional  and that no one is fixing the condition. More Reports that will supply more details.

This was seen in the Report018: Situations processed but not forwarded

EVENTREPORT018: Situations processed but not forwarded

Situation,Count,Nodes,

wlp_fretbsp_grzw_std,13,1,

This means that some events were sent to an event receiver, but not this one. That means that 47% of the TEMS workload provided monitoring for a condition that was not sent to an event receiver. That might explain why no one has fixed the condition. There was a very bad condition once where the hub TEMS was suffering and 80% of the incoming events created by one situation on one AIX Unix OS Agent running on a decommissioned AIX system that was supposed to be powered off. This isn’t always a wrong condition, of course, but anytime you have a non-forwarded situation event it needs to be reviewed and justified.

This was seen in the Report004:  Situations with Multiple results at TEMS with same DisplayItem at same second

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,TEMS_Second,Results,Agent,Atomize,Atom,

wlp_fretbsp_grzc_std,Sampled,1180429225812000,5,RZ:multip-multip-vaathmr406:RDB,,,

wlp_fretbsp_grzc_std,Sampled,1180429231916000,5,RZ:multip-multip-vaathmr406:RDB,,,

Here we see that Situation Events are being produced and each time there are 5 results. The Atomize and Atom are null which means that no DisplayItem has been set. This also means that events are being lost and never seen. Of course since the situation is not forwarded to an event receiver it may not matter. More information in the detail report later.

This was seen in the Report026:  Situations showing high Open<->Close rate

EVENTREPORT026: Situations showing high Open<->Close rate

Situation,Reeval,Rate,Node_ct,PDT

wlp_fretbsp_grzw_std,180,3.37,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

We see here that the situation has a re-evaluation rate of 180 seconds, 3 minutes. Every hour there are an average of 3.37 open->close or close->open transitions [per agent]  and  only one agent/node is involved.

Situations that show a rapid rate of opening and closing are suspect. It usually means they fail the basic test of Rare/Exceptional. How can a condition be rare if it shows a new open event multiple times an hour. This needs to be examined for reasonableness. The lack of DisplayItem may be causing this effect. When DisplayItems are configured, each result can separately control a situation event and that means there is less internal TEMS confusion. However since the Situation is not forwarded, this may be just internal friction, wasting resources to no benefit.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

wlp_fretbsp_grzw_std,RZ:multip-multip-vaathmr406:RDB,REMOTE_va10plvtem021,1180429225812000,1180429225815145,Y,180,5,,,8103,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,

This is mostly what we have seen before. We do get to see the TEMS and Agent Time. We can also see that this was seen at line 8103 of the input.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=STRSCAN(KRZTSNLUE.TNAME, N’TEMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDO’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’RBS’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’ROLLBACK’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’FNTMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’2013′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS1′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS2′) = 0 AND KRZTSNLUE.PCIFREE > 00500 AND KRZTSNLUE.PCIFREE <= 01000  AND KRZTSOVEW.STATUS <> N’READ’,

,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and 5 result lines.

,,,,,,,0,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=7168000;KRZTSNLUE.KBUSED=131072000;KRZTSNLUE.KBYTES=138240000;KRZTSNLUE.KLARGEST=7168000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=7000;KRZTSNLUE.MBUSED=128000;KRZTSNLUE.MBYTES=135000;KRZTSNLUE.MLARGEST=7000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9902;KRZTSNLUE.PCIFREE=519;KRZTSNLUE.PCIMAX=103;KRZTSNLUE.PCIUSED=9481;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,1,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=26214400;KRZTSNLUE.KBUSED=466227200;KRZTSNLUE.KBYTES=492441600;KRZTSNLUE.KLARGEST=26214400;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=25600;KRZTSNLUE.MBUSED=455300;KRZTSNLUE.MBYTES=480900;KRZTSNLUE.MLARGEST=25600;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9653;KRZTSNLUE.PCIFREE=532;KRZTSNLUE.PCIMAX=367;KRZTSNLUE.PCIUSED=9468;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_IDX;KRZTSNLUE.TSFGNUM=1,

,,,,,,,2,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=17408000;KRZTSNLUE.KBUSED=234496000;KRZTSNLUE.KBYTES=251904000;KRZTSNLUE.KLARGEST=17408000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=17000;KRZTSNLUE.MBUSED=229000;KRZTSNLUE.MBYTES=246000;KRZTSNLUE.MLARGEST=17000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9825;KRZTSNLUE.PCIFREE=691;KRZTSNLUE.PCIMAX=188;KRZTSNLUE.PCIUSED=9309;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201612_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,3,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=11264000;KRZTSNLUE.KBUSED=206848000;KRZTSNLUE.KBYTES=218112000;KRZTSNLUE.KLARGEST=11264000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=11000;KRZTSNLUE.MBUSED=202000;KRZTSNLUE.MBYTES=213000;KRZTSNLUE.MLARGEST=11000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9846;KRZTSNLUE.PCIFREE=516;KRZTSNLUE.PCIMAX=163;KRZTSNLUE.PCIUSED=9484;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201702_DAT;KRZTSNLUE.TSFGNUM=1,

,,,,,,,4,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=25395200;KRZTSNLUE.KBUSED=462131200;KRZTSNLUE.KBYTES=487526400;KRZTSNLUE.KLARGEST=25395200;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=24800;KRZTSNLUE.MBUSED=451300;KRZTSNLUE.MBYTES=476100;KRZTSNLUE.MLARGEST=24800;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9656;KRZTSNLUE.PCIFREE=521;KRZTSNLUE.PCIMAX=363;KRZTSNLUE.PCIUSED=9479;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201705_IDX;KRZTSNLUE.TSFGNUM=1,

Reviewing the attributes, it appears that the differentiating attribute  is  KRZTSNLUE.TNAME which takes on five different values – the last is CASOL_CR201705_IDX. 

What are the problems and How to fix them?

The first problem is that there are five results and they are being merged – so four are lost. That could be resolved by configuring DisplayItem to KRZTSNLUE.TNAME.

The second problem is that the Situation event goes open and close 3.17 times per hour. This means it is not Rare/Exceptional and so the formula needs to be reviewed for reasonableness. Most important who is going to fix the condition. If not one is going to fix the condition, the situation should be stopped and deleted. The best performance gain comes from not doing unnecessary work.

The third problem is that the Situation is not forwarded. This is only alerted when some situations are forwarded but not this. This is often seen when a situation is being developed [and tested on a single system] and then forgotten. There can also be good reasons for not-forwarding such as when used by a Workflow Policy or in an *UNTIL/*SIT control to block other situations. However it is most often an oversight and should be stopped and deleted.

Summary

Tale #12 of using Event Audit History is about reviewing a case where half the TEMS workload is coming from a single situation

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Open Air Pool – Cruise Ship Build 2015

 

One thought on “Sitworld: Event History #12 High Impact Situations And Much More

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: