Sitworld: Event History #15 High Results Situation to No Purpose

backdeck

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 25 May 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Summary Section

Total Result Bytes: 1023369249 989.47 K/min Worry[197.89%]

This environment is receiving almost one megabyte of results data per minute. Experience has shown that problems often occur if the result rate is over 500K per minute. That is the source of the “worry” percentage. Your mileage may vary based on the server running the workload. Even if the server can handle the workload it is never a good idea to perform useless work.

This was seen in the Report011:  Event/Results Budget Situations Report by Result Bytes

EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT

deb_prccpu_xuxw_aix,UNIXPS,2784,60,978,13.46%,0.97,269576,750499584,73.34%,0,0,0,0,0,0,269576,750499584,0,0,978,39,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,

So there is a situation deb_prccpu_xuxw_aix which runs every 60 seconds and checks for one process and alerts the CPU% is between 1% and 4%. It runs on 39 agents connected to this remote TEMS.

Remarkably, this one situation causes an estimated  73.34% of the total estimated workload. This is an estimate because the data does not include information about situations which started before the Event Status History data. The actual result data can be higher because of real time data requests. 

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410002104999,1180410001843000,Y,60,1,,,3850,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,

Situation was deb_prccpu_xuxw_aix, agent was deb_gb02cap070debx7:KUX. thrunode was REMOTE_gbnhham080tmsxm. Agent_time was 1180410002104999 and TEMS_time was 1180410001843000, so the agent is running a few minutes ahead. It was an Open event [Y], sampling interval was 600, there was one result, there was no DisplayItem. The record came from line number 3850 in the input. The PDT is shown at the end.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=UNIXPS.CPUPERCENT >= 100 AND UNIXPS.UCOMMAND = N’/opt/BESClient/bin/BESClient’ AND UNIXPS.CPUPERCENT < 400,

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time. 



,,,,,,,0,UNIXPS.ADDR=b424b459;UNIXPS.BCMD=BESClient;UNIXPS.CHILDSTIME=0;UNIXPS.CHILDTIME=0;UNIXPS.CHILDUTIME=0;UNIXPS.CMD=/opt/BESClient/bin/BESClient;UNIXPS.COMMAND=/opt/BESClient/bin/BESClient;UNIXPS.CONTSWIT64=460694;UNIXPS.CONTSWITCH=460694;UNIXPS.CPU=120;UNIXPS.CPUID=-1;UNIXPS.CPUPERCENT=144;UNIXPS.CPUTIME=62;UNIXPS.EGID=0;UNIXPS.EGRPN=system;UNIXPS.ELAPTIME=000d04:25:13;UNIXPS.EUID=0;UNIXPS.EUSERN=root;UNIXPS.EVENT=*;UNIXPS.EXECSTATE=A;UNIXPS.FLAG=   40001;UNIXPS.GID=0;UNIXPS.GRPN=system;UNIXPS.HEAP=-1;UNIXPS.INVCONTS64=11490;UNIXPS.INVCONTSWT=11490;UNIXPS.MAJORFAU64=9;UNIXPS.MAJORFAULT=9;UNIXPS.MEMPERCENT=100;UNIXPS.MINORFAU64=65168;UNIXPS.MINORFAULT=65168;UNIXPS.NICE=22;UNIXPS.ORIGINNODE=deb_gb02cap070debx7:KUX;UNIXPS.PGID=8847390;UNIXPS.PID=11206814;UNIXPS.PPID=1;UNIXPS.PRIORITY=64;UNIXPS.PROCCOUNT=1;UNIXPS.PSU=10792;UNIXPS.RDS=10792;UNIXPS.READWRI64=610219238;UNIXPS.READWRITE=610219238;UNIXPS.RTS=5177;UNIXPS.SCHEDCLASS=N/A;UNIXPS.SESSIONID=8847390;UNIXPS.SIZE=63876;UNIXPS.STACK=-1;UNIXPS.STARTTIME=1

180409195521000;UNIXPS.SYSTEMTIM=000d00:00:11;UNIXPS.SYSTEMTYPE=AIX;UNIXPS.TEXT_SIZE=15207;UNIXPS.THREADCNT=4;UNIXPS.TIME=00001:02;UNIXPS.TIMESTAMP=1180410002034000;UNIXPS.TOTALTIME=000d00:01:02;UNIXPS.TOTCPUPERC=38;UNIXPS.TTY=-;UNIXPS.UCMD=/opt/BESClient/bin/BESClient;UNIXPS.UCOMMAND=/opt/BESClient/bin/BESClient;UNIXPS.UID=0;UNIXPS.UPROCFILT=;UNIXPS.USERNAME=root;UNIXPS.USERTIME=000d00:00:51;UNIXPS.UUSERNAME=root;UNIXPS.VSIZE=58372;UNIXPS.WAITCPUTIM=0;UNIXPS.WAITLKTIME=0;UNIXPS.WLM_NAME=Unclassified;UNIXPS.WPAR_NAME=Global;UNIXPS.ZONEID=-1;UNIXPS.ZONENAME=-1,

Here is where I extracted the value result. This is raw data and represents 1.44%.



==> UNIXPS.CPUPERCENT=144  

I will skip repeating the  full details. Next you see the results coming in false and then true. Each time it is true I record the value.

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410003604999,1180410003343000,N,60,0,

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013304999,1180410013043000,Y,60,1

==> UNIXPS.CPUPERCENT=336

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013504999,1180410013243000,N,60,0,

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020704999,1180410020444000,Y,60,1

==> UNIXPS.CPUPERCENT=110

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020804999,1180410020543000,N

deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410043404999,1180410043143000,Y,60,1

==> UNIXPS.CPUPERCENT=121

What is the problem and How to fix it?

From this capture, 13.46% of the events and 73.34% of the result workload was from this situation. This from only 39 agents!

Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is “fixing” the condition.

So this situation is not rare, not exceptional and clearly no one is “fixing” it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers. If it could be stopped, the workload would drop substantially. Thus the situation should be reviewed and justified.

Summary

Tale #15 of using Event Audit History to understand and review a high overhead situation and thus potentially save resources.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Back Deck – Big Sur 1999

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: