John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

I started work on the Situation Audit project and verified which column functions forced TEMS filtering. I set up sample situations and then did the workload trace

tacmd settrace -m <temsname> -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)’

The diagnostic log showed what was happening. Most cases were perfectly clear although there were surprises. For example *STR was Agent filtered but *SCAN was TEMS filtered. However *MIN and *MAX did not create a situation event, even though rows were returned to the TEMS. That was very puzzling.

Background

Here is an example of a situation formula using *MIN:

( MIN(Cumulative Busy CPU (Percent)) == *TRUE AND

Cumulative Busy CPU (Percent) > 0.00)

The idea is to examine all Linux processes with Cumulative Busy CPU above zero, select the one with the minimum such value and create an event based on that data. We will discuss later whether that can produce anything interesting. However from the workload trace there were 6 rows of data returned. Clearly one had to be a minimum but there was no event created.

I retested on a ITM 623 system and found a case where there was an event created but the data was not right. The minimum attribute was displayed correctly, but looking at all rows the rest of the data was from a non-minimum row. The Process ID, the command name etc was all wrong.

Defect or design problem?

I talked with our chief dataserver developer guru and he told me that area had not changed substantially since the Candle acquisition in June 2004. In looking closely the situation he spotted a fundamental problem in the actual SQL.

Before the agent can work with a situation the predicate must be converted to PREDICATE or SQL that represents the needed logic. That is kept in the SITDB table. Here is what the test SQL above looks like

SELECT BUSYCPU, CMAJFLT, CMD, CMDLINE, CMINFLT, CPUAFF, CPUPERCENT, CPUSECONDS, DIRTPG, DRS, INTPRI, LRS, NICE, ORIGINNODE, PBUSYNORM, PGID, PID, PPID, PROCCOUNT, PROCTHRD, PSYSCPU, PSYSNORM, PUSRCPU, PUSRNORM, RSS, SESSIONID, SHAREMEM, SIZE, STATE, SYSTEMTIM, TBUSYCPU, TIME, TIMESTAMP, TOTALTIME, TRS, TSYSCPU, TUSRCPU, UPROCFILT, USERTIME, USRSYSCPU, VMDATA, VMDATAMB, VMEXESZ, VMEXESZMB, VMLIBSZ, VMLIBSZMB, VMLOCK, VMLOCKMB, VMSIZE, VMSIZEMB, VMSTACK, VMSTACKMB, MIN(KLZPROC.CPUSECONDS)

FROM KLZ.KLZPROC

WHERE

SYSTEM.PARMA(“SITNAME”, “IBM_TEMS_AGENT_min”, 18) AND SYSTEM.PARMA(“NUM_VERSION”, “15”, 2) AND

SYSTEM.PARMA(“LSTDATE”, “1140314191604000”, 16) AND SYSTEM.PARMA(“SITINFO”, “TFWD=N;OV=N;”, 12) AND KLZPROC.CPUSECONDS > 0

GROUP BY KLZPROC.ORIGINNODE

HAVING MIN(KLZPROC.CPUSECONDS) = KLZPROC.CPUSECONDS ;

The columns represent all the attributes which are captured when the situation formula is true [ see History Note 1 below]. All the names are in Table and Column form and not the attribute names. You can find the correspondence in the Object Definition Interface [ODI] file which the portal client uses. This would be in the docklz file and it is for the Linux OS Agent.

The SYSTEM.PARMA is how the TEMS lets the agent know Situation information like the name etc.

The situation logic is here

KLZPROC.CPUSECONDS > 0

GROUP BY KLZPROC.ORIGINNODE

HAVING MIN(KLZPROC.CPUSECONDS) = KLZPROC.CPUSECONDS ;

The last two lines are supposed to represent the *MIN logic. The GROUP BY doesn’t make much sense. It is supposed to be limited to columns which have been grouped before. In this case the ORIGINNODE or agent name is all the same so all the rows are to be examined. The HAVING BY in SQL can only reference the grouped value. This would be legal

HAVING MIN(KLZPROC.CPUSECONDS) > 10

But the comparison to current row is totally illegal SQL. At that point, you only have grouped rows and so you cannot compare row by row as this supposedly does.

Further Testing

Our Dataserver guru created a table in the TEMS and ran some SQL to populate the table and then run SQL parallel to the above. Sure enough, depending on the initial rows contents you could get 1) no rows or 2) rows with bad data or 3) sometimes rows with the right data.

The firm conclusion was that this was a design defect, but how to proceed was controversial.

Next Steps

The initial thought was that this defect should be corrected. However practical concerns weighed heavily. We would need to add a non-SQL quirk to the dataserver. That change could be expensive, since a two pass logic was required. That was a new area for the dataserver and few were comfortable with predicting how much time would be required to develop the change. There was also the serious risk of affecting existing SQL usage.

A second question was how common the usage was. Here came the biggest surprise. For the Situation Audit project I could add logic to count such cases. Redoing over 50,000 situations I found just one – not production – situation. That was a PMR July 2013 where the customer discovered it wasn’t working. In 2014 I worked with a second client that wanted to use it and experienced problems. That is a very small number of potential users.

Question on Usefulness and Efficiency

How useful would such a situation be? It is hard to make a case for usefulness. The data returned would be “on all processes the one with maximum CPU usage is xxxxx”. That might be valuable for a time-sharing environment but that is rare these days. The same data can be gotten by a Portal Client workspace view on demand if it is just for interest sake. Such a situation might be interesting but not perhaps something you want a problem ticket issued on.

How efficient would such a situation be? In most cases, the situation will be true every time… because something will be minimum or maximum. Since it is TEMS Filtering, many rows will be sent to the remote TEMS on each cycle. Imagine that happening for hundreds of agents every couple minutes. This could severely impact the TEMS. Other such cases have destabilized the remote TEMS and caused outages.

In the end we decided unanimously that we should just document the current condition and leave it alone. I did add it as a Situation Audit advisory message as a warning.

Summary

This document explains recent discoveries about *MIN and *MAX column functions

Sitworld: Table of Contents

History Note 1:

You won’t always see all the attributes. Each attribute has a function called “cost”. The maximum cost of the attributes used in the formula are used to limit the attributes asked for to attributes with that cost or lower. This was because in the initial OS/400 implementation [January 1997] some of the attributes took many minutes to retrieve. To achieve a timely response, we avoided retrieving them unless the situation formula called it out.

This is mostly ancient history since most agents set the cost at zero. However the LocalTime attribute group has an attribute Time with a cost of 9. If you set a formula for Hours *GT 00 then the results will NOT include the Time attribute. I suspect if you closely examined the ODI for the i Series agent you would see several examples also.

Sitworld: MIN and MAX – the Little Column Functions That Couldn’t

John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

Background

Defect or design problem?

Further Testing

Next Steps

Question on Usefulness and Efficiency

Summary

Sitworld: Table of Contents

History Note 1:

Note: Wild Turkey visiting a friend in Fremont – Winter 2014

Leave a comment Cancel reply

John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

Background

Defect or design problem?

Further Testing

Next Steps

Question on Usefulness and Efficiency

Summary

Sitworld: Table of Contents

History Note 1:

Note: Wild Turkey visiting a friend in Fremont – Winter 2014

Share this:

Related

Leave a comment Cancel reply