Tuesday, January 11, 2011

DB2 for z/OS: CATMAINT and Concurrency

In the years since mainframe DB2 data sharing was introduced with DB2 for z/OS Version 4 (mid-1990s), I've done a lot of presenting and writing about the technology (e.g., a blog entry from a couple of years ago that provided an overview of the topic). From the get-go, one of the primary benefits of DB2 data sharing was the opportunity to achieve ultra-high availability. A lot of people have long understood that when you have DB2 operating in data sharing mode, planned outages for purposes such as software maintenance upgrades can be virtually eliminated: you apply fixes to your DB2 load library, then quiesce work running on member X of the data sharing group (allowing work to continue flowing to other members), then stop and restart member X to activate the maintenance, then resume the flow of application work to member X, then quiesce work on member Y, and repeat the preceding steps until all members are running the updated DB2 code (as this "round-robin" process progresses, the group runs fine with some members at maintenance level "n" and some at level "n+1").

Still, there was this widely accepted notion that DB2 data sharing wouldn't deliver 24X365 availability every year, because during some years (generally speaking, once every two or three years) you'd migrate the data sharing group to a new release of DB2, and you'd need a group-wide outage to do that -- right? One of the steps involved in migrating a DB2 system to a new release of the code is the running of an IBM-supplied job, after initially starting DB2 at the new release level, that executes the CATMAINT utility. CATMAINT effects some structural changes to the DB2 catalog and directory (some new tables, some new columns in existing tables, some new and/or altered indexes on tables), and you can't have DB2-accessing application work running while THAT'S going, can you? These concerns linger in the minds of plenty of mainframe DB2 people to this day, but I'm here to tell you that they shouldn't. You CAN keep your application workload running in a DB2 data sharing group, even through an upgrade to a new release of DB2.

Here's the deal: a long time ago (and I'm talking like early 1990s), it WAS recommended that you not run application work while running CATMAINT to update the DB2 catalog and directory to a new-release structure. That was OK with most folks. After all, you had to stop the application workload anyway during a DB2 migration, because you'd of course stop your DB2 Version N subsystem and then start DB2 at the Version N+1 release level. With the flow of DB2-accessing work temporarily stopped anyway, why not leave it stopped just a little longer while CATMAINT did its thing (typically well under an hour -- and CATMAINT elapsed time went down significantly starting with the DB2 V8 to V9 migration process).

Along came data sharing, and implementers of this technology by and large stayed with the old practice of not having application work running during CATMAINT execution (and throughout this entry, I'm referring to CATMAINT being used to effect catalog and directory changes as part of a DB2 release migration, as opposed to the other CATMAINT options introduced with DB2 9 to facilitate large-scale changes of VCAT or OWNER or SCHEMA name for objects in a DB2 database). Many people just didn't think about doing otherwise, but as the need for super-high availability became increasingly prevalent, more and more DB2 administrators started to explore the possibility of running DB2-accessing application programs during CATMAINT execution, and found that it was indeed technically possible. As time goes by, continuing the flow of application work during CATMAINT execution is becoming more common at DB2 sites, especially at sites running DB2 in data sharing mode. 24X365Xn (with "n" being greater than 1 and including years during which DB2 is migrated to a new release) really is possible with a DB2 data sharing group.

Now, is this deal completely catch-free? Not entirely. Because CATMAINT does change some objects in the DB2 catalog and directory, it can make these objects temporarily unavailable (the particular changes made by CATMAINT will vary, depending on the DB2 release to which you're migrating). The duration of that unavailability is likely to be pretty brief, particularly so since the big CATMAINT speed-up delivered with DB2 9, but if an application program happens to require access to one of these objects while it's being changed by CATMAINT, the program could fail with a timeout or "resource unavailable" error code. Similarly, the CATMAINT utility itself could fail due to contention with application work. If that happens, it's NOT a disaster: you'd terminate the job with the TERM UTILITY command and then re-execute CATMAINT from the beginning (you'd actually resubmit migration job DSNTIJTC, which executes CATMAINT).

So, some disruption is possible. To minimize conflict, consider the following:
  1. Run CATMAINT during a period of relatively low application workload volume. At some sites, batch activity is halted during the time of CATMAINT execution, so that only online programs are accessing DB2 (this can be quite do-able, as the batch suspension may have to be in effect for only a few minutes).
  2. Avoid executing DDL statements (e.g., CREATE TABLE) while CATMAINT is running.
  3. Avoid package bind and rebind actions while CATMAINT is running.
These same guidelines apply with respect to running application work during the execution of the CATENFM utility, which makes catalog and directory changes necessary for the migration of a DB2 for z/OS subsystem from Conversion Mode to New Function Mode within a release level.

The bottom line: if you have a DB2 data sharing group and you're thinking that you'll have to stop application access to DB2 as part of the migration to a new release of the DBMS, think again. You CAN keep an application workload going while CATMAINT is running on one of the members of the data sharing group (and that includes running application work on the member on which CATMAINT is executing), and you can do the same with regard to CATENFM execution. If you want to take a brief application outage (referring to DB2-accessing programs) while CATMAINT is running, you can of course do that. Just know that you don't HAVE to. Figure out what's appropriate for your organization, and proceed accordingly. 

6 comments:

  1. Robert,

    Well thats "New and Different" -- I'm not sure I like allowing access while I'm updating the Catalog. It sort of sounds like trying to change the Oil while someone's driving the car. Anyway, what happens if the "unthinkable" occurs where user access trashes CATMAINT and you end up with a very unhappy CATALOG (not to mention the Client)? While I have a great deal of faith in the folks from the LAB this one sounds a bit too much like "Magic" to me -- I've got to see this one...

    .....Ken Hynes

    ReplyDelete
  2. I hear what you're saying, Ken, but I stand by what I wrote. CATMAINT (and CATENFM) is designed for concurrency and restartability. You say you've got to see it, and I'll tell you that there are sites that do this in the real world (referring to the execution of CATMAINT while the application workload is running). Backing up the catalog and directory beforehand is a wise precaution, but I've yet to hear of anyone who has ended up with a corrupted catalog as a result of running CATMAINT and application work concurrently using a recent version of DB2 (Version 8 and up). Note that concurrency of CATMAINT and application work tends to be most important to data sharing users, because they actually can keep DB2 available to application programs through a version upgrade (in a standalone DB2 environment you have to stop and start the one DB2 subsystem anyway to go from version X to version Y). Note also that people who run CATMAINT (and CATENFM) while application work is executing generally try to do that during a period of relatively low application activity -- this to reduce the odds of an application time-out (or a CATMAINT time-out that will require restarting the utility). I'm using the term "restart" loosely here: if CATMAINT doesn't complete successfully, you terminate it and rerun the utility from the beginning. If a terminated execution of CATMAINT leaves some indexes in REBUILD-pending status, we have (in DB2 9 and up) online index rebuild.

    ReplyDelete
  3. Stand by what you wrote as much as you like (and I respect your blogging immensely) - in the real world there are real sites out there that have experienced total outages thanks to catalog contention killing catmaint jobs and then leaving the catalogs in an unknown state - particularly skip level migration to V10 - there's a number of HIPERS in that area. I'd take an outage rather than risk being sacked after a job that should have worked didn't - this is the real world and things go wrong.

    ReplyDelete
    Replies
    1. I hear you. I won't deny that in the real world, there are organizations that have run into problems with online migration of a DB2 for z/OS system (and we're talking about a DB2 data sharing group on a Parallel Sysplex -- an outage is necessary in migrating a standalone subsystem to a new version of DB2, because you have to stop the subsystem at the "n" release level in order to restart it as a release "n+1" subsystem). I'll also say this: I work in the real world, with real DB2 for z/OS-using companies, and I know for a fact that organizations have successfully migrated DB2 systems from one version to another without ever stopping the DB2-accessing application workload. Is there some risk in taking that approach, which involves running CATMAINT while applications are accessing data in the system? Yes. What's the surest way to eliminate that risk? Do as you suggest: take a group-wide outage, from a data access perspective (probably don't need more than an hour or two, depending on the size of your catalog), and have CATMAINT be the only thing executing on the system when it runs.

      If that approach (take the group-wide outage) virtually eliminates the risk of CATMAINT-related problems, why doesn't everyone go this route? Why do some organizations decide to go for online migration? They do that because the importance or truly 100% application uptime for these companies is so important that they will take a migration approach that involves more risk in order to preserve continuous application access data. Obviously, if you decide that this more-risky approach is the one to take, you'll take steps to mitigate that risk: you'll plan the scheduling of the CATMAINT run very carefully (perhaps at a time at which the DB2 application workload volume is at low ebb), you'll ensure that the subsystems in your DB2 data sharing group are current on maintenance (including application of HIPER PTFs), and you'll take multiple steps to ensure that access to the catalog (by people pr processes other than CATMAINT) is minimized to the maximum extent possible (and this would be a factor in the scheduling decision).

      Bottom line: yes, there is risk involved in online migration to a new release of DB2 (DB2 11 delivers some enhancements in this area). That being the case, take a group-wide application data-access outage (again, it shouldn't be too long) to run CATMAINT, unless even this relatively brief outage would be a serious problem for the business. If it would, you can (and other organizations have) do an online migration. If you decide to do this, put in the time to carefully plan and get all your ducks in a row. And, just in case something goes wrong, have a contingency plan.

      Robert

      Delete
  4. Great!!!! Could you please provide the back up plan if Online migration(N to N+1 fails)

    ReplyDelete
    Replies
    1. I assume you're asking about what one would do if one were executing an online Db2 for z/OS migration - meaning, running CATMAINT on one member of a Db2 data sharing group while application work runs on other members - and CATMAINT failed. In that case, one would terminate the CATMAINT process via a TERM UTILITY command, and rerun CATMAINT from the beginning (see CATMAINT information online, particularly the part under the heading, "Termination or restart of CATMAINT": https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/ugref/src/tpc/db2z_utl_catmaint.html).

      Robert

      Delete