Sitworld: TEMS Database Repair

ginny1

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #5 – 10 December 2018 – Level 1.04000

Follow on twitter

Introduction

The TEMS database tables are used to store user data such as situation descriptions and distribution definitions. They also keep running data such as current situation status on agents. There many more internal and functional tables.

When the files holding the data are damaged and the TEMS usually malfunctions. Over the years there have been many reasons for such damage. Here are some examples

  1. TEMS exception and failure.
  2. File system full.
  3. Unwise manual changes or restoring from a backup that wasn’t taken correctly.
  4. Power outage without any UPS backup.
  5. SAN [Storage Access Network Device] failure.
  6. System shutdown without stopping the TEMS.
  7. Many unexplained instances.

Hub TEMS Recovery Attempt WARNING!!!

A primary hub TEMS is the repository of fundamental user data and any recovery of that is a delicate operation which can easily result in a reinstall and significant downtime. Please work with ITM support in planning a hub TEMS data recovery. Remote TEMS can be recovered quite simply as can a FTO mirror hub TEMS.

In addition you should have a Backup/Recovery plan for hub TEMS data. See this document for five different ways to accomplish this goal. A simple backup of the files while the TEMS is running is inadequate and can lead to significant downtime. These are hot database files and many constantly change and are tightly connected.

Non-Hub TEMS Recovery

The process is very simple although it varies by platform [hardware and operating system] and by TEMS maintenance level. From a high level view you stop the TEMS [if running], replace the database files with emptytable files and then start up the TEMS and let the hub TEMS refill with correct data naturally. A reference to the files follow. They are not exactly empty. At the very least they contain a “end of objects” record and some are pre-loaded with data. The ones here were accumulated from install media builds from ITM 6.2, ITM 621, ITM 622, itm 623 and ITM 630. They are the exact files you would lay down during a new TEMS install.

There are three  types of files:

  1. Bigendian – for Unix [AIX/Solaris/HPUX] and Linux on Z
  2. Littleendian – for Linux/Intel and Windows
  3. VSAM – z/OS index sequential file

The references here are to a zip file for each maintenance level. Each zip file contains a bigendian.tar file [Unix and zLinux], a littleendian.tar file [for Iinux/Intel] and a  littleendian.zip file for Windows. The last two contain identical files but are packaged differently for convenience. With z/OS the story is quite different, see later.

  1. ITM620_emptytables
  2. ITM621_emptytables
  3. ITM622_emptytables
  4. ITM623_emptytables
  5. ITM630_emptytables

Windows Recovery for non-hub TEMS

  1. Select the correct maintenance level and load the proper zip file from the links above. Unzip that file and you will use the .zip file included.
  2. Unzip that file into some convenient directory – we will assume C:\TEMP but it can be anyplace. You will see a lot of QA1*.DB file and QA1*.IDX files.
  3. Stop the TEMS
  4. Copy the files, for example [adjust for actual install directory]

cd c:\IBM\ITM\cms
copy c:\temp\QA1*.*

You could also use Windows explorer. You may also wish to make a safety copy of those files.

  1. Start the TEMS
  2. Monitor for correct operation.
  3. Recovery complete

Linux/Unix Recovery for non-hub TEMS

  1. Select the correct maintenance level and load the proper zip file from the links below. Most environments will have a gunzip command.  If not you can unzip on some convenient Windows workstation.
  2. Select the proper endian type. Bigendian is for all Unix and Linux on z systems. Littleendian is for all Linux/Intel systems. For this example we use linux at ITM 630 and the file is  ITM630_emptytables.littleendian_inux_intel.tar and it is assumed to be copied to /opt/IBM/ITM/tmp
  3. Move that littleendian file to the system where the TEMS runs and un-tar it.cd /opt/IBM/ITM/tmptar -xf ITM630_emptytables.littleendian_linux_intel.tar

    This will create many QA1* files

  4. At this point you have to determine the attributes/owner/group the current TEMS files. You could do that with these commandsls -l /opt/IBM/ITM/tables/<temsnodeid>/QA1CSTSH.DBwhich in my zLinux test environment looks like this:

    nmp180:~ # ls -l /opt/IBM/ITM/tables/HUB_NMP180/QA1CSTSH.DB

    -rwxr-xr-x 1 root root 35274789 Nov 14 21:03 … QA1CSTSH.DB

    [Above line shortened for display purposes.

  5. Next change the un-tar’d files to what is currently being used and what the TEMS expects. Remember the following is just an example that would be used in my environment. You will run the command appropriate to your actual environment,cd /opt/IBM/ITM/tmpchmod 755 QA1*.*

    chown root QA1*.*

    chgrp root QA1*.*

  6. Next stop the remote TEMS or FTO mirror hub TEMS
  7. Next copy the emptytable files into the directory where the  stopped TEMS expects themcd /opt/IBM/ITM/tables/<temsnodeid>cp /opt/IBM/ITM/tmp/QA1*.* .Note the trailing period which means to copy to the current directory.
  8. Next start the remote TEMS or FTO mirror hub TEMS
  9. Monitor for normal operations
  10. End of recovery
  11. Warning for the FTO mirror hub TEMS: When performing this operation *always* start the primary hub TEMS first [if not already running]. The refreshed FTO mirror hub TEMS must be started second. If that rule is violated the primary hub TEMS will have all custom objects deleted. Don’t do that.

z/OS recovery for non-hub TEMS

Please note: this is hardly ever needed. The last PMR I worked on *looked* like it was needed but the symptom was actually a harmless TEMS message [actually a defect] that complained about a table… and there was no actual problem at all! So I expect it is very rare to have to do this procedure.

Always involve IBM Support if you have any uncertainty at all in this process. Also, if you *think* you know more about z/OS than the author – you are very likely correct!!

z/OS recovery example with ICAT configuration

The following uses QA1CSTSH as an example.

1) Stop the TEMS task

2) Delete or rename the QA1CSTSH VSAM dataset. If unsure, examine the Joblog output to determine the complete dataset name.

3) Proceed to ICAT and navigate to the ‘Runtime Environments’ panel (KCIPRTE)

4)  Place a ‘B’ next to the RTE [Run Time Environment] that contains the TEMS that owns the file you wish to recreate.

5)  That will generate the DS#1xxxx job which should then be submitted.

6) The job will detect the file that is missing and recreate ONLY that file.

7) The job should complete with condition code zero

8) The TEMS can then be started.

z/OS recovery with PARMGEN configuration

The general idea is the same as ICAT.

For steps #3 – #7, it can be replaced w/ similar instructions here. That documents how to reallocate PDS files but the path followed is the same. Following are some notes from the Parmgen expert.

The job would vary – you can use KDSDELJB as a model job that has the deletes but only make it specific for RKDSSTSH VSAM

(//QA1CSTSH DD DISP=SHR,DSN=&RVHILEV..&SYS..RKDSSTSH.)

Submit the composite KCIJPALO job same as in the doc., and for the standalone job, refer to the PARMGEN KDSDVSRF – needs to be modified of course.

Hub TEMS – if you absolutely have no choice

There are many TEMS hub database tables which you can reset only by losing significant data and undergoing a long manual reinstall and rebuild. This could mean a week or more of outage. It is very important to involve IBM Support if you have any doubts at all.

However there are a few tables which can be reset with no real impact. These 5 sets of tables contain internal processing data, not user data.

  1. TSITSTSC – QA1CSTSC: The Situation Status Cache which is reused every time the TEMS starts.
  2. TSITSTSH – QA1CSTSH: The Situation Status History. This is an intermediate file where situation event status collect. It is a wraparound table and defaults to 8192 rows. At hub TEMS startup all the remote TEMSes and agents [if directly connected] send current status. Therefore you only miss situation status history after a reset. Since there are no ITM functions which display or use the history, nothing much is lost by resetting it to emptytable status.
  3. [several tables] – QA1CDSCA: This is the combined catalog table. If this is reset to emptytable status, at TEMS startup the pre-defined data is updated based on the existing package [like klz.cat] files. Therefore it can be reset to emptytable status and nothing is lost. As a minor point, TEMS has an extremely hard limit of 512 packages. At 513 the TEMS will crash and not come up. It is pretty rare but definitely something to keep aware of. Should you encounter this issue, you will have to remove one or more .cat [and the paired .atr] file to get the total down to 512 packages or below. If you encounter this limit see Sitworld: Attribute and Catalog Health Survey which will calculate what packages are no longer being used.
  4. SITDB/TOBJCOBJ – QA1CRULD/QA1CCOBJ: These tables are created dynamically as situations are started. SITDB contains the SQL representing the situation. TOBJCOBJ records how situations are related to each other. In any case the data is created dynamically as situations started. Both need to be reset to emptytable status at the same time.
  5. TNODESAV – QA1DNSAV: This records the current agent registration – the nodes or managed system names. When agents connect the data is rebuilt and also any missing data in the TNODELST table. This is sometimes shows as advisories in Database Health Checker reports and the agents affected do not actually run situations. One factor to consider is that agents which are temporarily offline will no longer be in the table. When they do connect again they will be present as usual.  If that is important you should capture that information before performing the replacement.

In each case you would do the same as a complete replacement but only handle the QA1*.DB and QA1*.IDX file.

Backup/Recovery best practice

The following document was co-authored with L3 TEMS and represents the best current thinking. It gives five ways to create a valid useful and reliable backup of the TEMS database files.

Best Practice TEMS Database Backup and Recovery

Summary

This document shows how to repair many cases of damaged TEMS database files.

Sitworld: Table of Contents

History and Earlier versions

1.00000
Initial publication

1.01000
Correct credit name for photo

1.02000
Add information about two more tables that be reset to emptytable status at the hub TEMS.

1.03000
Add warning about not starting refreshed FTP mirror hub TEMS first.

1.04000
Rename the emptytable files including the platform type – to reduce mistakes.

Photo Note: Ginny – A magnificent Maine Coon Cat that lives in Germany [thanks to IBMer Jens Helbig]

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: