Sitworld: Efficient Situation for Two Missing Processes

printercatbed

By John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

I was working though a remote TEMS performance problem using TEMS Audit  [See extract below]. The #1 and #3 top situation impactors alerted if two different Linux processes were missing. If one or the other or both were present, that was OK. Otherwise a situation event would be generated. The remote TEMS was receiving 8 megabytes a minute in results sent from 974 agents. The total from those two situations were 4.2 megabytes received per minute. They were distributed to 605 Linux agents.

The following is taken from a TEMS Audit report for a 21 minute period. Columns have been deleted for ease of presentation. We will be talking about situations 1/5 and 3/6. The 2/4 case is left for another post.

Rank Situation Rows Result/Min Fraction Cumulative%
1 XXX_PRCMissing_NSLCD_C 25543 1873153 23.62% 23.62%
2 XXX_SYSLoadAvg15Min_C 51507 1836008 23.15% 46.77%
3 XXX_PRCMissing_Rpcbind_C 21132 1549680 19.54% 66.31%
4 XXX_SYSLoadAvg5Min_C 14857 528568 6.66% 72.97%
5 XXX_PRCMissing_NSCD_C 4600 337333 4.25% 77.22%
6 XXX_PRCMissing_Portmap_C 4284 314160 3.96% 81.19%

Overview

The general approach is to review the situation definitions. Next the diagnostic log will be viewed for details coming from one agent. Lastly an alternative approach is suggested which will be more efficient.

One reason I wrote this up is to give a detailed example of a performance audit analysis how it is resolved. There are a lot of details and this can give you a head start in doing your own analysis.

Business Rule For Condition

If certain Linux systems are not running either nslcd or nscd, then create an alert.

Here is how the issue was originally implemented:

XXX_PRCMissing_NSLCD_C

pdt: *IF *MISSING KLZ_Process.Process_Command_Name *EQ ( ‘nslcd’ )

XXX_PRCMissing_NSCD_C

*IF *MISSING KLZ_Process.Process_Command_Name *EQ ( ‘nslcd’ )

The base situations are used in two functional situations:

XXX_PRCMising_nscd_nslcd_W

XXX_PRCMising_nscd_nslcd_C

Let’s look at one of the 605 Linux OS Agents for 60 seconds. There were 8 results received in one minute. I will ignore the events from other situations

3 – XXX_PRCMissing_Rpcbind_C

1- PrimeShift

1- XXX_CPUSMPBusy_90_W

We will review only the three result rows from XXX_PRCMissing_NSLCD_C.  Evidently this is a Linux system which is running nscd and nslcd is missing.

(517EC904.0025-5F5:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1343381874,3977249565> node <xxxgtps2d:LZ>

(517EC91E.000B-5BC:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1341302277,3985638231> node <xxxgtps2d:LZ>

(517EC928.0016-5A8:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1347595196,3979346861> node <xxxgtps2d:LZ>

The results arrived at seconds 0/26/36 in the minute. Initially it seems curious that three results arrive when the “nslcd” process is not running. It seems that only one should arrive at best.

The XXX_PRCMissing_NSLCD_C situation has no distribution. However two functional situations made use of that base situation. These other situations had distributions to MSLs that included this specific agent:

XXX_PRCMising_nscd_nslcd_W

XXX_PRCMising_nscd_nslcd_C

At the agent the two situations start up. If not otherwise started, the base situations will also be automatically started.

XXX_PRCMissing_NSLCD_C

XXX_PRCMissing_NSCD_C

The XXX_PRCMissing_NSCD_C never sends results. However each of the other three situations will send results.

XXX_PRCMissing_NSLCD_C

XXX_PRCMising_nscd_nslcd_W

XXX_PRCMising_nscd_nslcd_C

The situation XXX_PRCMissing_NSLCD_C sends results because that is what started situations do at the agent. A diagnostic trace of the remote TEMS would show the results are discarded because they are not associated with any task.

The two situations are of this form

pdt: *IF *SIT XXX_PRCMissing_NSCD_C *EQ *TRUE *AND

*SIT XXX_PRCMissing_NSLCD_C *EQ *TRUE

Results are sent to TEMS because this form of situation always requires TEMS evaluation.

The situation XXX_PRCMising_nscd_nslcd_W had the identical formula and was in fact an experiment which was not supposed to be running. It was stopped immediately.

That explains the 3 results being sent.

There were 605 Linux OS Agents. If they were all running the situations and also the nscd process, the number of rows in 21 minutes would be

3*605*21 = 38815 rows

Now some Linux OS Agents run nscd and some run nslcd. The sum of the two instance rows in the audit extract totals 30143. This is smaller because the situations are not distributed to all Linux OS Agents.  That number of rows results totals 2.2 megabytes per minute or 28% of the total.

This explains the results observed. The existing situations require 2.2 megabytes per minute to figure out if both processes are missing.

Alternative Situation

First make the situation name self explanatory, like this XXX_NSxCD_MISSING_C

Make the formula be

Process Filter == ‘.*(nscd|nslcd).*’ AND MISSING(Command Line) == (‘*’))

Skip this if you are familiar with Regular expressions. I am a novice after 15 years of creating them. Regular expression detail…

.*                     Match any number of characters

(                       Match for alternative strings and place in capture buffer

nscd                 Match “nscd” and place in capture buffer

|                        Alternatively…

nslcd                Match “nslcd” and place in capture buffer

)                       End of capture buffer

.*                     Match any number of characters

When a match is made, the capture buffer is placed into the Command Line attribute.

When the nscd or nslcd process is present a result row will be included. The MISSING value of asterisk [*] means match any string. When either process is present, MISSING will not be true and so no results are sent. When neither process is present, the MISSING will create an alert. There is no DisplayItem, but the name of the situation explains the issue.

When things are normal, the remote TEMS will receive no data at all. Thus we have reduced the impact of the situation from 2.2 megabytes per minute to zero on the average.

If there is a need for both a Critical [_C] and a Warning [_W] for different end users, the situation can just be cloned. Since there is no result workload under normal conditions, it really doesn’t matter if two are running.

Warning: the expression ‘.*(nscd|nslcd).*’ might not be enough to distinguish the processes uniquely. You might need to add additional expressions to the string or another attribute like

Process Filter == ‘.*(nscd|nslcd).*’ AND

Process Parent ID == 1 AND

MISSING(Command Line) == (‘*’))

Do thorough testing before settling on a final test!!

Parallel case – Business Rule

If certain Linux systems are not running either rpcbind or portmap, then create an alert.

This is exactly parallel to the first case. Create a situation

XXX_rpcbind_OR_portmap_MISSING_C

Use the formula

Process Filter == ‘.*(rpcbind|portmap).*’ AND MISSING(Command Line) == (‘*’))

It will have exactly the same effect for firing when both are missing.

Summary

TEMS Audit report showed several situations generating a heavy load of result data.

The situation of the type “at least one process running” on Linux/Unix can be run most efficiently. This document shows you how.

Sitworld: Table of Contents

Note: My Two Burmese Kittens in a Bed over the HP Printer – NOT Missing.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: