Sitworld: SOAP Flash Flood

LookingSouth

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

I was working at a customer site with occasional SOAP issues and ITM instability. There was large hub TEMS diagnostic tracing in place including SOAP request activity. One Monday afternoon about 1pm the ITM environment went unstable. All eight remote TEMS went offline and thus all agents showed as offline. An attempt to do a tacmd login did not complete. About an hour later remote TEMS and the connected agents started coming back online. After 2 hours everything was running normally.

Diagnostics

As it happened the hub TEMS was running with this trace

+52CEEACF.0000         KBB_RAS1: ERROR (UNIT:kpxrpcrq,Entry:”IRA_NCS_Sample” ER ST)  (UNIT:ko4sitma,Entry:”IBInterface::processOneResult” ER DET) (UNIT:kdsstc1,Entry:”ProcessTable” ER ST) (UNIT:kshreq,ENTRY:”buildSQL” ER DET)  (UNIT:kshprs,ENTRY:”WSQL_StartSoapRequest” ER FLOW)  (UNIT:kshstrt,ENTRY:”default_service” ALL) (UNIT:kdsruc1 ST ER)  (UNIT:kfaadloc ST ER) (UNIT:kdssqprs IN ER METRICS)  (UNIT:kfaprpst ST ER)(UNIT:kfansins FLOW ST ER)

In particular the SOAP request SQLs were captured. The log lines looked like this

(52CEF7ED.0015-ACC:kshreq.cpp,955,”buildSQL”) Using pre-built SQL: SELECT NODELIST, NODE FROM O4SRV.TNODELST WHERE NODELIST=’*HUB’

The SOAP request count totaled 10,097 in 244,194 seconds [roughly 2.8 days]. That is about one SOAP per 24 seconds. That is low compared to what I have seen,

I manually reviewed the time range when the instability was observed. There seemed to be a sudden burst of SOAP activity about that time.  I updated TEMS Audit in two ways

  1. Add a SOAP Burst advisory – if there were more then 300 requests in a minute
  2. Add a SOAP Detail report on a minute by minute basis [ignoring zero count minutes]

There had been a burst at that time. Here is the advisory.

Advisory: SOAP Burst requests per minute 1049% higher then nominal 300 at line 280017 in C:/pmrdata/59608,113,848…

Here is what that section of the SOAP detail report looked like [edited for space reasons].

Secs   Count Line   Log-segment

…………………

114080     5  196164 C:/pmrdata/59608,113,848

114120  2053  227910 C:/pmrdata/59608,113,848

114180  3447  280017 C:/pmrdata/59608,113,848

117000  1439  345137 C:/pmrdata/59608,113,848

117001     1  345161 C:/pmrdata/59608,113,848

Looking at the log segments in detail, there were almost 7000 SOAP requests in 133 seconds.

Detective work

I first imagined an evil actor. However the SQL did not look strange or unusual.

SELECT FULLNAME

FROM O4SRV.TNAME

WHERE ID = “XX_NT0_EVL_WIE_ID4624_ADMIN

SELECT  ID,OBJCLASS,SEQUENCE,TYPE,LSTDATE,LSTUSRPRF,LSTRELEASE,

LOCFLAG,RESERVED,MAP FROM O4SRV.EVNTMAP

WHERE ID=’XX_NT0_EVL_WIE_ID4624_ADMIN’

SELECT NODEL

FROM O4SRV.TOBJACCL

WHERE OBJNAME=’XX_NT0_EVL_WIE_ID4624_ADMIN’ AND

OBJCLASS = ‘5140’ AND

SYSTEM.PARMA(“QIBNODE”, “QOMEGAVIEW”, 32)

Almost half were TOBJACCL requests… about 3000. That happened to be the number of situations in the environment. TOBJACCL is the TEMS table that defines distribution – what objects should run at what managed systems.

Later in the week, the customer reported back that someone had routinely done a

tacmd login …

tacmd bulkExportSit -d

And now it made sense. That bulkExportSit function got a list of all the situations. The -d option meanst to include the distribution in the xml capture of the situation object.

The TOBJACCL table is kept in a TEMS database table – QA1DOBJA.DB/IDX. It was pretty large – about 8 megabytes. In a 133 second period, that table needed to be read over 3000 times. That totals 24 gigabytes of data. The TEMS got so backlogged with work that it did not communicate with the remote TEMS. The remote TEMS tasks went into offline status. The ITM environment did recover in 2 hours or so,

Results

The tacmd logic could do things more efficiently – like reading TOBJACCL and other tables just once. There is IBM Support work underway to determine if an APAR fix is warranted. The customer started doing that work on quiet times in the weekend,

Summary

A puzzling issue was fully diagnosed.

Sitworld: Table of Contents

Note: Looking South from Big Sur – Los Angeles is 300 miles away

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: