By John Alvord, IBM Corporation
A customer hub TEMS was unstable, crashing a few hours after startup. The user had created a MS_Offline type situation without the critical Reason != FA test and had an action command that sent an email. See this post for gory details about MS_Offline. That situation was turned off but a week later they reported another incident of instability. The instability improved from several hours to several days but was still very bad. The environment needed a fresh look.
I reviewed the situations running at the customer site. There were about 850 situations and over 200 had email action commands. I asked the customer about that and there was no event receiver installed like Omnibus. Email was the method used for alerting IT staff of conditions to be looked at. It was a relatively small environment and all emails were sent from the hub [and only] TEMS. By reviewing the TEMS operations log I could see there were occasional bursts of 50-100 emails in 5-10 seconds.
Action commands run simultaneously in the TEMS [or Agent] process space. That means they consume both process space and CPU time. If there is a large enough burst of commands that takes a lot of CPU and process space. As a direct result the TEMS can fail.
I’ve known about the issue for a long time and warned about it in the Action Command Tips and Techniques technote.
To recover from a case where a situation with action commands was crashing the TEMS on startup, I documented the Running TEMS without SITMON technique. With SITMON not running, you can run some SQL usind kdstsns to change the problem situation to not run at startup.
Later on I developed a more complex technique of storing the commands and later running one at a time. That is documented in this technote: Running Situation Action commands one at a time.
However this was a site that really had no practical alternative.
The Simple Solution
As anyone who reads this blog knows, I have been surfing through action commands recently. In one recent post I ran an action command in a different process. The reason there was to allow the command to continue while the agent stopped and started. Reflecting on that case – a simple solution came clear: run the email commands as separate processes.
A typical email command looks roughly like
echo “bad bad bad” | mail -s subject email@example.com
Linux/Unix will run this in background if you wrap the command in parentheses and adding a trailing ampersand
(echo “bad bad bad” | mail -s subject firstname.lastname@example.org) &
Windows will run a command in the background using the start [process] command and then a cmd /c to run command and then exit. The /min avoids popping up an application window.
start /min cmd /c echo “bad bad bad” | mail -s subject email@example.com
In each case the command will be run as a separate process. That means that the TEMS process space will not suffer a catastrophic increase in size and potential failure.
This is not a perfect solution. For example, there is no way to send an “condition is resolved” email. A proper event receiver like Omnibus is always preferred.
Possible Serious Side Effects
You might experience performance problems or worse.
If there are 1000 agents connected through a remote TEMS and that remote TEMS goes offline, there are 1000 agents declared offline immediately. If there is an email action command on a MS_Offline type situation, that would mean 1000 processes starting at the system running the TEMS. If the system paging space is not configured to handle the new worst case virtual storage usage a system failure might occur. Any situation author might create a situation which accidentally creates a lot of events. If there is an action command that would mean many commands running at the same time.
If the system has a limit on the total number of processes running, that might cause some action commands to not run.
The action commands would all use some CPU. That might temporarily cripple the TEMS and cause unpredictable bad effects.
This post shows how to run action commands outside of the TEMS process and thus avoid cases which might otherwise cause a TEMS failure.
Note: A Salmon my brother caught in Alaska – 1 June 2013
He shipped it to Maine for my mother’s 90th birthday celebration.