Sitworld: Detecting and Recovering from High Agent CPU Usage

sunset-fire

John Alvord. IBM Corporation

jalvord@us.ibm.com

Revised 27 July 2013

Inspiration

In a customer environment of 10,000+ Unix/Linux systems, the OS agents were observed running at high CPU occasionally. The particular issue was diagnosed as APAR IV18016 – a problem where some rare types of events created during autonomous mode [not connected to a TEMS] cause high CPU when the agent eventually connects to the TEMS. The customer was at ITM 623 FP1 and the problem was resolved at ITM 623 FP2. An upgrade was some time away and the customer needed a workaround much sooner.

Solution overview

The obvious solution was to use a situation based the Process.CPU_Pct [Unix] or KLZ_Process.Busy_CPU_Pct [Linux] value to trigger a recovery action. That did not work because until ITM 6.3 the Process CPU Busy per cent value always shows as zero for the klzagent and kuxagent. A more basic algorithm was required. I am sure that in any real world situation, changes would be needed but these are tested example situations. These are example situations and action commands and will always need testing and tailoring for usefulness in a specific environment. These are a starting point for something useful.

First part – an always true situation that uses a detector action command to calculate the OS Agent CPU percent. The action command uses Linux/Unix commands present on most systems. The action command will create or erase a marker file to represent a high CPU at OS Agent condition based on a threshold.

Second part – a situation to detect the marker file and trigger an OS Agent recycle. It also deletes files to complete the recovery action.

The action commands are 95% of the interesting work and so that is presented first.

Action Command – Detector for Unix OS Agent

The design is relatively simple:

  1. capture start CPU time of Unix OS Agent
  2. wait 180 seconds
  3. capture end CPU time of Unix OS Agent
  4. calculate CPU per cent over 180 seconds
  5. create marker file if over threshold else delete marker file

There are some compromises because the length is near the size limit of an action command. This could be avoided by working the logic into a script that runs on each system. This pure situation approach is simpler and can be turned on and off conveniently. Here is the example command followed by a full explanation.

cd $CANDLEHOME/logs;((echo `ps -e -o cputime,args | grep -v grep |grep kuxagent`);sleep 180;(echo `ps -e -o cputime,args | grep -v grep | grep kuxagent`)) |awk ‘{ss=substr($1,length($1)-7,8);getline;ee=substr($1,length($1)-7,8);}END {m=split(ss,s,”:”);st=(s[1]*3600) + (s[2]*60) + s[3];n=split(ee,e,”:”);et=(e[1]*3600) + (e[2]*60) + e[3];pct=(et-st)/1.8;print pct; if (pct>=25) system(“touch agt_cpu.high”);else system(“rm -f agt_cpu.high”)}’

Here is a detailed explanation of the action command:

cd $CANDLEHOME/logs;

==> make the logs directory be current directory where the marker file will be stored

(

==> Open sub shell level 1

(echo `ps -e -o cputime,args | grep -v grep |grep kuxagent`);

==> Within a sub shell, capture the current CPU time for the OS Agent. The values captured are the cputime in day-hh:mm:ss form and the args. The grep –v eliminates the line associated with the grep itself.

sleep 180;

==> wait for 180 seconds

(echo `ps -e -o cputime,args | grep -v grep | grep kuxagent`)

==> Capture CPU time after the delay

) |

==>After the sub shell finishes there is a standard output file with two lines and that is fed to the standard input of the next stage.

awk

==> This runs an inline program to do calculations.

‘{

==> The inline awk program is delimited with single quotes. The unlabeled function reads a file – in this case the two standard input lines.

ss=substr($1,length($1)-7,8);

==> ss contains the last 8 characters of the first blank delimited word. This removes the day value in order to fit within the action command length limitations. When the CPU time crosses a day boundary, that creates a negative value later on, so one sample could be wrong. However these sort of problems are usually constant and the result is only a short delay.

getline;

==> Read the next line of input.

ee=substr($1,length($1)-7,8);

==> ee contains the last 8 characters of the first blank delimited word;

}

==> End of the “read file function”.

END

==> Next is the function performed after the files are read.

{

==> Start of  the end process function

m=split(ss,s,”:”); st=(s[1]*3600) + (s[2]*60) + s[3];

==> Split hh:mm:ss into separate variables and calculate total seconds for start time

n=split(ee,e,”:”); et=(e[1]*3600) + (e[2]*60) + e[3];

==> Split hh:mm:ss into separate variables and calculate total seconds for end time

pct=(et-st)/1.8;

==> Calculate CPU percent used

if (pct>=25) system(“touch agt_cpu.high”);

==> If per cent is equal or more then threshold create a file.

else system(“rm -f agt_cpu.high”)

==> Else if lower, deleted the file

}’

==>End of awk inline program and end of action command.

Action Command – Recycler for Unix OS Agent

The recycler action command is more straightforward.

(sleep 30; rm -f  $CANDLEHOME/logs/agt_cpu.high; $CANDLEHOME/bin/itmcmd agent stop ux; sleep 30; rm -f `find $CANDLEHOME | grep psit_.*_KUX.str`; $CANDLEHOME/bin/itmcmd agent start ux; ) &

Here is a detailed explanation of the command line.

(

==> Open sub shell. The command line will be run as a separate process from the agent because of the trailing ampersand at the end. If the command ran in the agent process, it would stop after the agent stopped/

sleep 30;

==> Wait 30 seconds. This allows time for the event from the recycler situation to be sent to the TEMS.

rm -f  $CANDLEHOME/logs/agt_cpu.high;

==> Delete the marker file. Without this the recycler situation would constantly recycle the OS Agent.

$CANDLEHOME/bin/itmcmd agent stop ux;

==> Stop the Unix OS Agent

sleep 30;

==> Sleep 30 seconds. The itmcmd process may need time to finish cleanup.

rm -f `find $CANDLEHOME | grep psit_.*_KUX.str`;

==> Delete the persistent situation file. The back quotes run the embedded command first and so this dodges the issue of what architecture directory the file is present in.

$CANDLEHOME/bin/itmcmd agent start ux

==> Start the Unix OS Agent

) &

==> End of subshell and the trailing ampersand to cause command line to run in the background.

Action Command – Detector for Linux OS Agent

This is identical in form to the Unix version.

cd $CANDLEHOME/logs;((echo `ps -e -o cputime,args | grep -v grep |grep klzagent`);sleep 180;(echo `ps -e -o cputime,args | grep -v grep | grep klzagent`)) |awk ‘{ss=substr($1,length($1)-7,8);getline;ee=substr($1,length($1)-7,8);}END {m=split(ss,s,”:”);st=(s[1]*3600) + (s[2]*60) + s[3];n=split(ee,e,”:”);et=(e[1]*3600) + (e[2]*60) + e[3];pct=(et-st)/1.8; if (pct>=25) system(“touch agt_cpu.high”);else system(“rm -f agt_cpu.high”)}’

Action Command – Recycler for Linux OS Agent

This is identical in form to the Unix version.

(sleep 30; rm -f  $CANDLEHOME/logs/agt_cpu.high;  $CANDLEHOME/bin/itmcmd agent stop lz; sleep 30; rm -f `find $CANDLEHOME | grep psit_.*_LZ.str`; $CANDLEHOME/bin/itmcmd agent start lz; ) &

Action command testing

Never [!!!] just put an action command into a situation and expect it to work. Test it thoroughly in an actual command line first. Test it in segments and echo intermediate results to the screen or a log file. When you are doing situation testing, view the Agent operations log. That is the *.LG0 file. As each command runs you will the initial characters. The status code should be zero and not anything else. Finally observe to make sure the action command is doing the right thing.

With a preexisting s action command like this, review it careful and see what changes might be needed. For example, you could chose a longer or shorter sample interval [sleep 180 and the later divide by 1.8. You also might chose a different threshold instead of 25. You might chose to store the marker file in a different location. Test thoroughly and never assume anything is correct until tested.

Situations to drive the Detector/Recycler commands

The actions commands compose 95% of the interesting ideas. The example situations are available here. I will give here the example situation name and some comments.

situation: ibm_detect_linux_time  

pdt: *IF *VALUE Local_Time.Time *LT 250000

Sampling Interval 15 minutes, command runs every cycle

Notes: Simplest case and can run on any agent.

situation: ibm_detect_linux_version

pdt: *IF *VALUE KCA_LZ_Agent_Availability_Management_Status.Agent_Version *EQ ‘06220700’ *AND

*VALUE KCA_LZ_Agent_Availability_Management_Status.PAS_Agent_Name *EQ ‘Monitoring Agent for Linux OS’

Sampling Interval 15 minutes, command runs every cycle

Notes: Specific as to agent, Depends on AMS running – for example not true by default on zLinux

situation: ibm_recycle_linux

pdt: *IF *VALUE Linux_File_Information.Path_U *EQ ‘/opt/IBM/ITM/logs’ *AND *VALUE Linux_File_Information.File_Name_U *EQ ‘agt_cpu.high’

Sampling interval: 2 minutes

situation: ibm_detect_unix_time

Sampling Interval 15 minutes, command runs every cycle

pdt: *IF *VALUE Local_Time.Time *LT 250000

Notes: Simplest case and can run on any agent.

situation: ibm_detect_unix_version  

pdt: *IF *VALUE KCA_UX_Agent_Availability_Management_Status.Agent_Version *EQ ‘0623010’ *AND

*VALUE KCA_UX_Agent_Availability_Management_Status.PAS_Agent_Name *EQ ‘Monitoring Agent for Unix OS’

Sampling Interval 15 minutes, command runs every cycle

Notes: Specific as to agent, Depends on AMS running

situation: ibm_recycle_unix  

pdt: *IF *VALUE File_Information.Path *EQ ‘/opt/IBM/ITM/logs’ *AND *VALUE File_Information.File *EQ ‘agt_cpu.high’

Sampling interval: 2 minutes

situation: IBM_detect_solaris_global

pdt: *IF *VALUE Machine_Information.Hardware_Manufacturer *EQ ‘Sun Microsystems’ *AND *VALUE Machine_Information.Virtual_Machine_Identifier *EQ global

situation: IBM_detect_solaris

pdt: *IF *VALUE Machine_Information.Hardware_Manufacturer *EQ ‘Sun Microsystems’ *AND *VALUE Machine_Information.Virtual_Machine_Identifier *NE global

situation: IBM_detect_hpux

pdt: *IF *VALUE Machine_Information.Hardware_Manufacturer *EQ ‘Hewlett-Packard’

situation: IBM_detect_aix_wpars

pdt: *IF *VALUE Machine_Information.Hardware_Manufacturer *EQ IBM

Usage Notes

When first used about 10% of the OS Agents was recycled once and then not again. This was viewed as an excellent outcome.

A small number of agents were constantly recycling. When examined closely the systems were in serious trouble. For example one Linux system which normally had 250-300 processes was running with 6600 processes. There was a Solaris system where the Unix OS Agent was experiencing unknown problems and IBM Support had to get involved. This identification of problem systems was also viewed as an excellent outcome.

Some weeks after these situations were adapted and implemented, the system administrators of the AIX and Linux systems noticed a significant improvement and expressed pleasure at the changes.

The detector situation events were not forwarded to any event receivers and were not associated with any Portal Client navigator nodes when running in production. That is typical of helper situations. The recycler situations events were displayed as usual.

Performance Notes

The detector situations will cause open events whenever started. With a 15 minute sampling interval this will be somewhat lower impact than the node status updates. There will be a TEMS process size increase from the open events. In this case, the balance clearly tilts to the pro side in favor since the cost of a high CPU OS Agent is severe. The workaround is temporary of course since the particular APAR fix will eventually resolve the issue.

Once things are stabilized,  the detector/recycler commands might be run once a day or a week to look for new issues.

History

27 July 2013 – added example situations with action commands for

  1. Solaris Global Zone
  2. Solaris Local Zone and Solaris [same example]
  3. HP/UX
  4. AIX LPAR Global with WPARs

Note that Unix OS Agents are not supported for data collection on WPARs. They are supported for remote deploy operations only. Because of that support limitation there is no AIX/LPAR/WPAR example.

The related action commands use several added techniques to keep the size within situation action command limits. For example an awk function was used for cpu time to seconds conversion. Each environment needed slightly different command lines to achieve the desired goal but all follow the original pattern. The recycler situation commands were identical in form.

Summary

This post shows a general design to detect a specific issue on system with an ITM OS Agent and then execute a recovery action in a second situation. It is especially convenient as no external scripts needed to be added to the environment. I have used similar logic in the past to get many agents recycled with little manual effort to implement at configuration change.

Sitworld: Table of Contents

Credit

Thanks to Nathan Posey at Citibank. twitter [@nathanposey]. He helped make some final adjustments and lots of excellent testing.

Notes: An airplane contrail coinciding with a midsummer sunset.

 

4 thoughts on “Sitworld: Detecting and Recovering from High Agent CPU Usage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: