Sitworld: Sampling Interval Limits Time Tests

barnswallowbabiesinnest

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

A customer created a time test on a sampled situation and it was not behaving as expected. After resolving three preliminary issue, I found a hard limit between time tests and sampling interval. To speed up or prevent future such cases, here is the analysis.

Overview

A customer had a process that periodically created a log file on a Linux/Unix environment. The log file was monitored with a custom Agent Builder agent and with a sampled situation formula like this:

(  SCAN(Label) == ‘Heartbeat’ AND TIME(ITM Timestamp) < ‘#’Local_Time.Timestamp’ – 10M’)

The sampling interval was 10 minutes. Note that the TIME test forces all rows of data to be sent to the TEMS for evaluation.

The question of when a situation is TEMS evaluated deserves a whole separate post, but anytime you see a TIME test there is no doubt.

Skirmish One

I had the customer set a workload analysis and run TEMS Audit. The diagnostic log looked unusual. The KBB_RAS1_LOG setting [used for such purposes] had been set up incorrectly and two TEMS tasks [cms and kdsmain] were writing into it and overlaying data. I had them install a proper definition with this technote Using KBB_RAS1_LOG to Configure ITM Diagnostic and Operations Logs

The problem here was that the model segment name and inventory file needed to have the %(systask) included in the name. That way each of cms and kdsmain use their own logs.

Skirmish Two

When the data was properly captured, I saw that the trace volume was extremely high. Individual log segments of 50 megabytes were being finished in 10-15 seconds.  After enhancing TEMS Audit to record trace volume, I saw that trace data was being created at 194 megabytes a minute and 1.2 million lines per minute. The trace at TEMS startup was the default “ERROR”. Therefore someone in the past must have issued the service console command

ras1 set all

I have seen cases where a hub TEMS could not finish initializing with that level of tracing. Normally it is used only when there is a predictable very early TEMS crash. The system running the remote TEMS was powerful and it handled the post startup tracing OK with CPU at 5-10%. The system I/O capacity was not so strong and the 3+ megabytes of trace data per second written to disk seriously impacted remote TEMS work.

The client had no record of that being set up. In any case, a remote TEMS recycle cured that problem.

No one ever admitted it, but perhaps that came from a moment of frustration. The customer had one more trace request. So he set the diagnostic log size high [16 segments of 50 megs each] and set tracing to all. Sure THAT would answer any question from IBM Support!! That failed to recognize the performance impact and also that the log would wrap so quickly a particular issue would be over written before any data could be collected. That is rampant speculation but fits human nature.

Skirmish Three

The next TEMS Audit showed 5.5 megabytes of result data incoming – about 10 times the nominal stability rate. This was largely due to ten situations with Filter object too big. Those are cases where the situation formula compiled objects are too large to transmit to the agent. The agent runs in non-filtered mode – sending ALL data to the TEMS for evaluation.

I had the customer split each situation into multiple situations that could be transmitted to the agents.

Future note: APAR IV46572 in the upcoming ITM 623 FP4 and ITM 630 FP3 has a new product provided feature – TEMS_Alert. When that is running it will create a situation event alert on the “Filter object too big” condition. Then you can eliminate this previously hidden performance problem with minimal fuss. Depending on how you do the install, you may have to set the distribution to *ALL_CMS after the upgrade.

Skirmish Four

There were still cases where the situation did not behave as expected.

I pointed out that the problem would be much better handled using the Derivative Log Pattern.In that design the customer agent would have a situation to identify the heartbeat and write the attributes to a secondary log. A pure situation would monitor that secondary log [using another attribute group] and immediate notice would be sent when the problem occurred.

Regardless, the customer wanted a sampled situation.

I reflected on timing examples. In a previous blog post on Sampling Intervals one section explains how for sampled situations the maximum delay between a condition occurring and the event being raised is twice the sampling interval.

Now the issue became obvious. Assume that the log heartbeat time is set at the time the log is created. Look at this timeline

12:00:00          Situation Starts at TEMS

12:01:00          Situation Starts at Agent and is evaluated and there is no data

12:01:45          new log is produced for analysis heartbeat is at 12:01:45

12:10:00          TEMS situation evaluation proceeds – no results from Agent

12:11:00          Situation evaluation at Agent processes log and sends all results to TEMS

12:20:00          TEMS situation evaluation proceeds – and alerts because time >10min

All sampled TEMS evaluated situations go through this cycle. Anything with a TIME test against LocalTime needs TEMS evaluation and thus this situation.

Conclusion

Because of the different Agent and TEMS processing cycles, any situation author must be aware of this constraint. The general rule must be

TIME difference must be greater than two times the sampling interval.

However there may be arbitrary delays based on competing processes and network delays. So I suggest a stronger rule for fewer false alerts:

TIME difference must be greater than three times the sampling interval.

Customer results

The sampling interval was set to 4 minutes – with a time difference test at 10 minutes and no further problems were seen.

I believe a derivative log and pure situation provides better information. However the customer was reasonably happy with the results.

Summary

This shows a new constraint on sampled situations which have TIME tests against LocalTime in the formula.

History

2015/10/26 – corrected time in hypothetical analysis

Sitworld: Table of Contents

Photo Notes: Barn Swallow Babies in Nest Two Days before Fledging – 2013/08/19

For the fourth year we had Barn Swallows nesting in the eves. Barn Swallows migrate to deep Patagonia in Argentina, South America each year.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: