Sitworld: Best Practice TEMS Database Backup and Recovery

condors5ofthem

John Alvord, IBM Corporation

jalvord@us.ibm.com

20 June 2016 – Version 1.2

Inspiration

I was working with a customer with a TEMS Database File problem. In this case some of the situations had been deleted. In other cases over the years the Database files were not accessible because the index file was inconsistent. These cases are very rare but the results can be disruptive. The hub TEMS or some remote TEMS cannot start or are running without all situations and other objects.

This document presents five best practice procedures for creating reliable database backups. It is not a reference for creating a full and complete backup including configuration and application support files. See TivoliEnterpriseMonitoringServerbackup for that reference. This document is dated. Based on history there will me more changes in the future.

At version 1.2, five empty table zip file references were added. See Note 2 at the end.

Background for Distributed Linux/Unix/Windows platform

There are 50+ TEMS database tables and most of them are represented by indexed sequential files [QA1*.DB/IDX]. 16 of those tables contain user data such as situation definitions and distribution configuration. The IDX file links the keys of a table to the location of the related objects in the DB file. If there is an interruption in the update process, the IDX file may become inconsistent and the data unavailable.

Here are some cases which caused an interruption in the past.

  1. TEMS crash
  2. System shutdown without stopping TEMS [AIX system before ITM 623 FP3]
  3. Mount point or disk full
  4. Hardware failure where system or SAN lost power
  5. Networking outage when writing to a NFS mount

There are certainly many more possible causes. These are just the cases I have seen over the years.

The TEMS environment variable KGLCB_FSYNC_ENABLED defaults to 1 and that decreases the chances of problems. Review the environment variables in <installdir>/logs/ms.env [or Windows <installdir>\logs\ms.env and if it is set to zero [0], you should change that setting.

These are very rare cases.  When and if the problem ever hits, a recovery plan will ensure a prompt return to normal processing.

A Poor Backup Plan

While the hub TEMS is running, make a copy of the QA1* files in

Linux/Unix:  <installdir>/tables/<temsname>

Windows: <installdir>\cms

That is better then nothing but it might result in an inconsistent set of tables because tables are constantly changing. With the on the fly captured files the TEMS might not even start up.

Solution 1 – No Secondary hub TMS

The simplest and easiest backup plan is to stop the hub TEMS before copying the QA1* files into a compressed tar or zip file. That ensures capturing a consistent state.

If you do that once a week during a maintenance period, you can always restore those files and have a consistent state. There is certainly a cost in doing that but the cost for an outage is much higher.

Solution 1 – Recovery

  1. Stop the hub TEMS.
  2. Make a pdcollect to capture the current state
  3. Restore the QA1* files from a backup
  4. Start the hub TEMS

At this point all objects will be restored to the time of backup.

Prepare empty table files

The next solutions require a maintenance level specific copy of all the TEMS database files representing empty files. See Note 2 for emptytable file references.

Solution 2 – Hot Backup – Valid from ITM 622 FP5

In this configuration you have two hub TEMS and but only one is used ever used as the primary hub TEMS.  The other hub TEMS is started for backup purposes only. This not an actual FTO configuration but it uses FTO logic to get the job done,

The hot backup hub TEMS is configured with FTO pointed to the running hub TEMS. That has an additional required control in the KBBENV file which is in

Windows:        <installdir>\cms

Linux/Unix:     <installdir>/tables/<temsname

Add this line manually

            MHM:HOTBACKUP=1

When the hot backup hub TEMS starts, it works to make sure that its own synchronized database files match the other hub TEMS. The first hub TEMS and all the remote TEMS are totally unaware of this usage. When the synchronization is complete [See Note 1] stop the Hot Backup hub TEMS and archive the QA1* files.

Solution 2 – Recovery

When a problem is found with TEMS database files, a recovery action is required. This case does require some hub TEMS down time.

1) Stop the usual hub TEMS if still running.

2) Configure the usual hub TEMS with FTO with the partner being the Hot Backup hub TEMS. Add in the MHM:HOTBACKUP=1 manually to the usual hub TEMS KBBENV file.

3) Replace the problem hub TEMS QA1* files with the saved “empty table” files.

4) Configure the Hot Backup hub TEMS to NOT use FTO.

5) Restore the backup  QA1* files to the backup hub TEMS.

6) Start the Hot Backup hub TEMS.

7) Start the problem hub TEMS and wait for it to synchronize with the Hot Backup hub TEMS. [Note 1]

8) Stop the Hot Backup hub TEMS

9) Stop the usual hub TEMS.

10) Configure the usual hub TEMS so it is not using FTO and remove the line MHM:HOTBACKUP=1

11) Start the usual hub TEMS.

12) Configure the Hot Standby hub TEMS to use FTO and make sure the MHM:HOTBACKUP=1 is present.

13) Verify normal operation.

Solution 3 – Fault Tolerant Option [FTO]

In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Once a week or so stop the current backup hub TEMS and copy the QA1* files into a compressed tar or zip file.

Solution 3a – Recovery When one hub TEMS is OK

After a problem is found connected with TEMS database files, a recovery action is required. Usually this is the primary hub TEMS.

1) Stop the hub TEMS with the problem [if required] and the remote TEMS tasks will switch over to the backup hub TEMS which takes on the primary role.

2) Replace the problem hub TEMS QA1* files with the saved “empty table” files.

3) Start the problem hub TEMS and wait until synchronization is complete. [Note 1]

4) Stop the usual backup hub TEMS.

5) After processing switches back to the usual primary hub TEMS start the usual backup hub TEMS again.

6) Verify normal operation.

Solution 3b – Recovery When Neither hub TEMS is OK

1) Stop both hub TEMS tasks.

2) For the TEMS the backup was taken on, replace the QA1* with the saved files.

3) Replace the other hub TEMS QA1* files with the saved “empty table” files.

4) Start the hub TEMS with the backup QA1* files.

5) After 20 minutes start the hub TEMS with the empty files. Wait until synchronization is complete. [Note 1].

6) If needed, stop the usual backup hub TEMS. After processing switches back to the usual primary hub TEMS start the usual backup hub TEMS again.

7) Verify normal operation.

Solution 4 – FTO and Hot Backup – Valid from ITM 622 FP5

In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Create a third hub TEMS used only for backup purposes. The two FTO hub TEMS will be totally unaware of the backup process so normal operations are unaffected.

Use the Solution 2 documented. The hub TEMS used only for backup is configured in the “Hot Backup” mode and is configured to the usual primary hub TEMS. Before the backup process, this new TEMS is started and the TEMS database files are synchronized. See Note 1 for determining when the synchronization is complete. This will normally complete in 10-20 minutes but you could also scan the operations log file for the named messages. At that time stop the TEMS. The QA1* files are in a stable synchronized state and are sufficient to be used for a recovery.

If a recovery is needed and one hub TEMS is OK, Solution 3a recovery is sufficient.

If a recovery is needed and both usual primary hub TEMS and usual backup hub TEMS are damaged, use this solution 4 backup with the Solution 3b recovery process.

Credits

Many kudos to Richard Bennett, IBM Support L3 TEMS team lead for his extensive knowledge and his wise editing suggestions.

Summary

This document is a best practice procedure for creating a reliable backup for TEMS Database files and how to use those files in a recovery action.

Sitworld: Table of Contents

History

1.2 – Added Emptytable file references.

1.1 – Added Solution 4

1.0 – Initial publication

Note 1

The TEMS operations log is located in

Windows: <installdir>\cms\kdsmain.msg

Linux/Unix: <installdir>/logs/hostname_ms_decimaltime.log

During a recovery like this there will be a long series of messages about individual objects being updated. Look for one of the following messages in the TEMS which is being recovered:

KQM0009 FTO promoted <temsname> as the acting HUB.

KQM0013 The <temsname> is now the acting HUB.

KQM0014 The <temsname> is now the standby HUB.

This message(s) occur when synchronization between the primary monitoring server and the secondary monitoring server has been completed.

Note 2: Emptytable Zip Files

The best practice backup/recovery process require emptytable files. Following are references to 5 zip files for 5 ITM maintenance levels. Each zip contains three files within

1) .zip file for Windows

2) littleendian.tar – for Linux/Intel

3) bigendian.tar – for Unix [AIX/Solaris/HPUX] and Linux on Z

ITM630_emptytables

ITM623_emptytables

ITM622_emptytables

ITM621_emptytables

ITM620_emptytables

Or – access via github: https://github.com/jalvo2014/emptytables

Treat these emptytable files with great care. If some hub TEMS tables were replaced, you would lose all of the custom created which was created manually over the years. That could mean an extensive outage and a lot of manual work to recover. On the other hand you can replace remote TEMS database files and the FTO backup hub TEMS files [when the TEMS is stopped] with full confidence.

The Linux/Unix files need some preparation and unpacking. They must have the same attributes/owner/group as the files currently installed. That is accomplished with the chmod/chown/chgrp commands.

If you are using an emptytable for the first time or have any doubts about usage, involve IBM Support before any action.

Photo Note: Five Condors warning what might happen without a TEMS database backup process.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: