Friday, January 20, 2023

Db2 for z/OS: What is "Wait for Other Read" Time, and What Can You Do About It?

A recent conversation I had with some folks who support a large Db2 for z/OS system reminded me of the importance of something called "wait for other read time." In this blog entry I want to make clear to people what Db2 wait-for-other-read time is, why it's important, how to monitor it and what to do about it if it becomes an issue.


What is Db2 for z/OS "wait for other read" time?

In Db2 performance monitoring parlance, time associated with SQL statement execution is known as "in-Db2" time. It's also called "class 2" time, because it is recorded, for monitoring purposes, in records that are generated when Db2 accounting trace class 2 is active. Class 2 elapsed time (elapsed time pertaining to SQL statement execution) has two main components: CPU time (some of which is consumed on so-called general-purpose processors - aka "engines" - of a mainframe server, and some of which might be consumed on what are known as zIIP engines) and suspend time (on a busy system there can be another component of in-Db2 time, called "not accounted for" time, that generally reflects wait-for-dispatch time). In-Db2 suspend time is also known as "class 3" time, because it is recorded in Db2 accounting trace records when accounting trace class 3 is active. Class 3 time is broken out in a number of categories, and these show up in an accounting long report that might be generated by your Db2 monitor, or by a Db2 monitor's online display of thread detail information.

In a Db2 monitor-generated accounting long report, class 3 suspend times are shown as "average" values. Average per what? Well, if you're looking at information for a Db2-accessing batch workload (referring to jobs that run in z/OS JES initiator address spaces and access Db2 by way of Db2's call attachment facility or TSO attachment facility), it'll be average per batch job (generally speaking, activity for one batch job will be recorded in one Db2 accounting trace record). If you're looking at a transactional workload (e.g., a CICS-Db2 workload, or a Db2-accessing IMS transactional workload, or a DDF client-server workload), the "average" values seen in a Db2 monitor-generated accounting long report will typically be average per transaction.

In many cases, the majority of in-Db2 time for a batch or a transactional workload will be class 3 suspend time (it is a little unusual, but certainly not unheard of, for a Db2 workload's in-Db2 time to be mostly CPU time). More often than not, the largest component of in-Db2 class 3 suspension time will be wait-for-synchronous-database-read time. Another wait-for-read time is labeled "wait for other read." What's that? Well, if it's "other than" synchronous read wait time, it must be asynchronous read time, right? Right, indeed. And what are asynchronous reads? Those are prefetch reads: read I/Os driven by Db2 in anticipation that the pages read into memory in bulk in this way will be requested by the process (such as an application process) that prompted Db2 to issue the prefetch read requests. Well, if a prefetch read I/O operation is executed because Db2 is aiming to get pages into a buffer pool in memory before they are requested by (for example) an application process, why would there be such a thing as a process having to wait for a prefetch read to complete?

Wait-for-prefetch read (reported as "wait for other read") happens because there are usually lots of Db2-accessing processes active in a system at one time. Let's call two of these processes process A and process B, and let's say that Db2 is driving prefetch reads (these could be sequential, list or dynamic prefetch reads - more on that in a moment) for process A. We'll further suppose that Db2 needs to access page 123 of table space TS1 on behalf of process B (i.e., Db2 issues a GETPAGE request for page 123 of table space TS1). If page 123 of table space TS1 is not already in the buffer pool to which TS1 is assigned, Db2 will drive a synchronous read request to get that page into memory, right? Not necessarily. It could be that page 123 of TS1 is already scheduled to be brought into memory via a prefetch read that is being executed on behalf of process A. If that is the case then process B will wait for that in-flight prefetch read to complete, and that wait time will be recorded as "wait for other read time" for process B. [It is also possible that process A has gotten to the point that it needs to access page 123 of TS1, and the prefetch read that will bring that page into memory is currently in-flight, and that would end up causing wait-for-other-read time for process A related to the prefetch request being driven on behalf of process A, but I think it's more likely that wait-for-other-read time will be associated with one process waiting on completion of a prefetch read operation that is being executed on behalf of another process.]


Why is wait-for-other-read time important?

Usually, wait-for-other-read time is a relatively small percentage of total class 3 suspend time for a process (it's typically much smaller than wait-for-synchronous-read time), but that's not always the case. In some situations, wait-for-other-read time is a major component of overall in-Db2 suspend time. The performance impact of elevated wait-for-other-read time can be especially significant for batch applications, as these Db2 processes are often particularly reliant on prefetch to achieve elapsed time objectives. If wait-for-other-read time gets too large then service levels could degrade, leading to user dissatisfaction.


How can wait-for-other-read time be monitored?

As mentioned previously, wait-for-other read time is recorded in accounting long (i.e., accounting detail) reports that can be generated by a Db2 performance monitor; so, you can track that for a process or a workload over time and note trends. Besides wait-for-other-read time itself, are there any other related fields in Db2 monitor-generated reports that you should keep your eye on to help ensure that a wait-for-other-read time problem does not sneak up on you? Yes, as explained below.

The "other related fields" that I'd recommend checking out are found in a Db2 monitor-generated statistics long report (i.e., statistics detail report). In such a report you would see, for each buffer pool, a set of fields like those shown below (this is a snippet of a statistics long report generated by the IBM OMEGAMON for Db2 for z/OS performance monitor - I've added some A, B, C labels that I'll subsequently use in referencing various of these fields):


BP1 READ OPERATIONS          QUANTITY  /SECOND
---------------------------  --------  -------

SEQUENTIAL PREFETCH REQUEST   5622.00     3.23   A
SEQUENTIAL PREFETCH READS     5587.00     3.21   B
PAGES READ VIA SEQ.PREFETCH  52950.00    30.43   C
S.PRF.PAGES READ/S.PRF.READ      9.48            D
LIST PREFETCH REQUESTS       47394.00    27.24   E
LIST PREFETCH READS           5876.00     3.38   F
PAGES READ VIA LIST PREFTCH    154.9K    89.03   G
L.PRF.PAGES READ/L.PRF.READ     26.36            H
DYNAMIC PREFETCH REQUESTED     378.3K   217.42   I
DYNAMIC PREFETCH READS         157.6K    90.59   J
PAGES READ VIA DYN.PREFETCH   3110.6K  1787.68   K
D.PRF.PAGES READ/D.PRF.READ     19.73            L

By way of explanation, I'll first point out that what you see above are three repeating sets of fields (4 fields in each set) that pertain to sequential, list and dynamic prefetch activity. Here are thumbnail definitions of these prefetch types:

  • Sequential - Generally speaking, this is the prefetch mode used for table space scans or for non-matching index scans. In other words, if Db2 determines that a front-to-back scan of a table space or index will be required, sequential prefetch will be used (assuming that the table or index in question is not super-small, in which case prefetch of any kind would usually not make sense).
  • List - This is the prefetch type used when Db2 is retrieving table rows based on a list of row IDs (RIDs) that have been retrieved from an index (or from more than one index, if index ANDing or index ORing is part of the access plan for the query). List prefetch can be efficient if the clustering sequence of rows in the target table is substantially uncorrelated with respect to the order of entries in the index in question (the list of RIDs obtained from the index is sorted in ascending RID sequence and then the sorted RID list is used to prefetch pages of associated rows from the target table). The hybrid method of joining tables is another driver of list prefetch activity.
  • Dynamic - This prefetch method is dynamically initiated at statement execution time when Db2 recognizes a sequential pattern of data access as it retrieves rows. Matching index scans are often drivers of dynamic prefetch activity.
OK, so here are a couple of things to keep an eye on, if you want to avoid a surprise situation involving elevated levels of wait-for-other-read time for processes that use prefetch to access pages of objects assigned to a given buffer pool:

  • Prefetch reads relative to prefetch requests - This tends to be more important for list and dynamic prefetch (less so for sequential prefetch, owing to locality of reference being less of a factor in that case). For list and dynamic prefetch, then, compare the number of prefetch reads to the number of prefetch requests (i.e., compare F to E, and J to I, using the letter-labels I added to the statistics report snippet shown above). What's this about? Well, a prefetch request is just that - a request to read a certain chunk of pages from a table space or an index into the assigned buffer pool. Suppose the prefetch request is for 32 pages (the most common quantity), and suppose that all 32 of those pages are already in the buffer pool. In that case, the prefetch request will not drive a prefetch read I/O operation. The larger the number of buffers allocated for a pool, the greater the likelihood that all pages associated with a prefetch request will already be in memory, thereby reducing prefetch reads as a percentage of prefetch requests. If you see the percentage of prefetch reads relative to prefetch requests going up over time for a pool, especially for list and/or dynamic prefetch, that's an indication that elevated levels of wait-for-other-read time could be in the offing. Why? Because more prefetch reads will generally mean more waiting for prefetch reads to complete.
  • The number of pages read per prefetch read - These are the fields labeled D, H and L in the example statistics report snippet. If you see that number going up for one or more prefetch types (sequential, list, dynamic), it could be an early-warning sign of higher wait-for-other-read times. Why? Because a prefetch read that will bring 25 pages into memory is likely to take longer than a prefetch read that will bring 5 pages into memory (recall that a prefetch read I/O is driven to bring into memory the pages, associated with a prefetch request, that are not already in the buffer pool). When prefetch reads take longer to complete, it is likely that application processes will see higher levels of wait-for-other-read time.
At this point you may have put two and two together, and are thinking, "Hmm. It seems to me that a growing number of prefetch reads relative to prefetch requests combined with an increase in the number of pages read into memory per prefetch read would really be a flashing yellow light with respect to wait-for-other-read time." Right you are. In that case, two things are happening, and both have negative implications for wait-for-other-read time: there are more prefetch reads (because there are fewer cases in which all pages associated with a prefetch request are already in memory) and each prefetch read, on average, is taking longer to complete (because each read, on average, is brining more pages into memory). If too much of that goes on, you could "hit the curve of the hockey stick" and see a sharp and sudden increase in applications' wait-for-other-read times. Better to take a corrective action before that happens. But what?

Glad you asked...


What can you do to reduce (or head off an increase in) wait-for-other-read time?

If wait-for-other-read time has become problematic, or if you see the warning signs and want to take a preemptive action, what can you do? Here are some possibilities:

  • Increase the size of the buffer pool in question - Simple: more buffers in a pool leads to increased page residency time, and that leads to 1) more prefetch requests NOT leading to prefetch reads (because all pages associated with a request are already in memory) and 2) fewer pages, on average, per prefetch read. Fewer prefetch reads + quicker execution of prefetch reads that do occur = less wait-for-other-read time. Obvious related question: "Can I make a buffer pool bigger? I don't want to put too much pressure on the z/OS LPAR's real storage resource." My response: check the LPAR's demand paging rate (available via an RMF Summary Report for the LPAR). If the demand paging rate is zero or a very small non-zero value (i.e., less than 1 per second), there is little to no pressure on the real storage resource, and you have a green light for making the buffer pool bigger. If the demand paging rate is 2-3 per second or more, and you don't want it to go higher than that (I wouldn't want it to go higher than that), consider reducing the size of a buffer pool that has low GETPAGE-per-second and read-I/O-per-second values, and increase the size of the buffer pool of concern by a like amount (so, the overall size of the buffer pool configuration remains the same). In my experience, plenty of Db2 for z/OS-using organization under-utilize the real storage resources of production z/OS LPARs.
  • Change some query access plans - If it looks as though sequential prefetch reads are the primary contributor to higher wait-for-other read times, you can consider taking actions that would reduce table space scan and/or non-matching index scan activity. For that, you could potentially use a query monitor to identify longer-running queries that access objects assigned to the buffer pool in question, and examine EXPLAIN output for those queries to see if any of them have table space scans and/or non-matching index scans in their access plans that involve objects assigned to the buffer pool. Then, consider whether it would be worth it to create a new index or indexes to eliminate such scans (there are cost factors associated with new indexes - you want the benefit to outweigh the cost), or whether simply adding a column to an existing index might reduce scan activity (there is a cost associated with that, too, but it's not as high as the cost of a new index). For dynamic prefetch, keep in mind that this is often related to matching index scans. You can sometimes reduce that activity by enabling Db2 to do more result set row filtering at the index level, and that often involves trying to increase MATCHOLS values for one or more predicates of longer-running and/or more-frequently-executed queries (referring to the name of a column in the PLAN_TABLE in which EXPLAIN output is found). Boosting MATCHCOLS can involve things such as changing an index (add a column, or change the order of columns in a key - keeping in mind that the latter change could benefit some queries and negatively impact others), or maybe re-coding some non-indexable predicates to make them index-able, or maybe adding a predicate that does not change a query's result set. For list prefetch, keep in mind that this often has to do with rows in a table being clustered in a sequence that is very different from the sequence of the index used in the list prefetch operation. You might consider whether a table's clustering key is what it should be - the clustering key of a table can always be changed, and sometimes it makes sense to do that. Also, when index ANDing is driving a lot of list prefetch activity, increasing index-level filtering can help (maybe by adding a column to an index involved in the ANDing, to increase the column-match number, or adding an index that would take the number of indexes AND-ed from n to n+1).
  • Take a look at RID pool activity - List prefetch operations involve use of the Db2 subsystem's RID pool. If the RID pool can't be used for a RID processing operation, Db2 will fall back to a table space scan for the target table, and that can drive sequential prefetch numbers up. RID pools these days are MUCH larger than they used to be (the default RID pool size in a Db2 12 or Db2 13 system is 1 GB), so incidences of RID processing "failed (or not used) due to lack of storage" - something that is indicated in Db2 monitor-generated accounting long as well as statistics long reports - are now quite rare. What you could potentially see, however, in the RID processing block of a statistics long report is a relatively large number of occurrences of "failed due to RDS limit exceeded." What this RID-pool-not-used counter means: If Db2 is executing a query, and a RID list processing action commences, and Db2 determines that more than 25% of the RIDs in the index being accessed will be qualified by the predicate in question, Db2 will abandon the RID list processing action in favor of a table space scan. Can you do anything about this? Maybe. In the case of a static SQL statement, it is my recollection that this RDS limit value is embedded in the associated Db2 package; so, at bind time, Db2 notes the number of RIDs that would exceed 25% of the RIDs in an index that is to be used as part of a RID-list-utilizing access plan action. Why is this potentially important? Because an index could grow substantially in the months (or years) following the most recent bind or rebind of a package. What this means: if you have an application process for which there are many occurrences of "RID list processing failed - RDS limit exceeded," check how long it has been since the package (or packages) associated with the process were last bound or rebound. If it's been a long time, and if you think that relevant indexes have grown substantially since then, consider rebinding the packages - if that rebind results in new and larger "this is the RDS limit threshold for this index" values being embedded in the package, that value increase might be enough to reduce incidences of "RID list processing failed - RDS limit exceeded" for the package.
OK, that's what I've got on this topic. As I mentioned up front: in-Db2 wait-for-other-read time is usually not a matter of concern for Db2 application performance. In some cases, it can be an issue. This blog entry is aimed at helping you should such a case arise at your site (or even better, to help ensure that it doesn't become an issue for you).

Wednesday, December 21, 2022

Db2 13 for z/OS: Setting Lock Timeout Limit and Deadlock Priority at the Application Level

Db2 13 for z/OS, which became generally available about seven months ago, introduced two interesting features that are similar in some ways but differ in one important aspect (about which I'll comment momentarily). These new features allow an application (or, more broadly, a "process") to set its own lock timeout limit and/or its own deadlock priority. With this blog entry I aim to provide related information that will be useful for you.


Application-level lock timeout

First, let's establish the need for this Db2 13 enhancement. Historically, there has been one lock timeout limit - specified via the IRLMRWT parameter in ZPARM - that applies to all processes interacting with a Db2 subsystem. While IRLMRWT is still there in a Db2 13 environment, it became apparent some time ago that "one size fits all" will often NOT be ideal when it comes to lock timeout in a Db2 system. Think about it. Suppose the value of IRLMRWT is at the default of 30 seconds for a production Db2 system at your site. You might have a developer of a Db2-accessing online application say, "What? NO! This app has a mobile front-end and users can get VERY frustrated if they have to wait more than a few seconds for a transaction to complete. It would be TERRIBLE to have a transaction sit and wait for 30 seconds to get a Db2 lock. We need the lock timeout value to be WAY lower than 30 seconds." At the same time, a developer of a long-running batch application might say, "What? NO! This job HAS to complete once it gets started or we miss SLAs and have angry customers. The job typically runs for five hours, and maybe it's been running for four hours and you want to time it out because it's been waiting for a lock for 30 seconds? 30 seconds is NOTHING as far as this job's concerned. The Db2 lock timeout value should be SUBSTANTIALLY greater than 30 seconds." Both of the developers are expressing legit concerns. How can those disparate concerns be addressed?

They can be addressed via the new (with Db2 13) special register named CURRENT LOCK TIMEOUT (available for use when Db2 13 function level 500 has been activated). Here are some things to know about CURRENT LOCK TIMEOUT:

  • The value of the special register - expressed in seconds - can be anything between -1 and 32767 (or a site-specified upper bound - see the next item in this list). A value if -1 means that the process will not be timed out if it ends up waiting for a lock - it will wait until it gets the requested lock or becomes deadlocked with some other process. A value of 0 means that the process does not want to wait at all for a lock - it wants to get an error message if a requested lock can't be obtained immediately (this basically makes available for application use a formerly Db2-internal mechanism known as a conditional lock request).
  • If the default upper-limit value of 32767 seconds is deemed by a Db2-using organization to be too high, a different max value can be provided via the new (with Db2 13) ZPARM parameter SPREG_LOCK_TIMEOUT_MAX. If you set that value to (for example) 1800, no process will be able to set the CURRENT LOCK TIMEOUT special register to a value greater than 1800 seconds.
  • If a lock timeout occurs and an application-level timeout limit was in effect for the lock requester and/or for the lock holder, that will be reflected in the information provided via the DSNT376I lock timeout message generated by Db2.
  • The value of the CURRENT LOCK TIMEOUT special register can be set automatically for an application by way of the Db2 profile tables, and not just for DDF-using applications (more information on this is provided below).

Application-level deadlock priority

A deadlock, of course, happens when process A holds a lock that process B needs in order to proceed, and process B holds a lock that process A needs in order to proceed. With both processes in a not-able-to-proceed state, Db2 detects the deadlock and chooses a "winner" and a "loser." The "loser" process is rolled back, causing it to release locks it had held, and that enables the other process (the "winner") to acquire the lock for which it had been waiting.

All well and good, except for the fact that one traditionally has been able to do little to nothing to influence Db2's choice of winner and loser in deadlock situations. That changes starting with function level 501 of Db2 13, thanks to a new built-in global variable named DEADLOCK_RESOLUTION_PRIORITY.

Before providing some helpful (I hope) items of information about DEADLOCK_RESOLUTION_PRIORITY, let me point out a very important difference between this new feature and the previously-described CURRENT LOCK TIMEOUT: the latter is a special register, while the former is a global variable. Why is that notable? One simple reason: any process can set the value of a special register, but a process must have permission to set the value of a global variable. The rationale for making DEADLOCK_RESOLUTION_PRIORITY a global variable may already be clear to you: if a value for DEADLOCK_RESOLUTION_PRIORITY could be set by any process, one could imagine everyone setting the global variable to its maximum value ("I always want to be the winner in a deadlock situation"), and that would defeat the purpose of the new capability (as the bad guy, Syndrome, in the movie The Incredibles put it, "When everyone is super, no one will be"). The permission-only nature of DEADLOCK_RESOLUTION_PRIORITY means that (for example) a Db2 DBA can assign the max-priority value to a "must-complete" database administration process, and know that under almost any circumstances (exceptions noted below) the process will be the winner in case of a deadlock. The same could be done - with permission received from a DBA - for a high-priority application process.

OK, items of related information that might be good for you to know:
  • The maximum value for DEADLOCK_RESOLUTION_PRIORITY is 255 (the range of acceptable values for the global variable is 0-255).
  • If you're a Db2 DBA, you might think, "I have a process that I think of as 'should-complete,' versus 'must-complete.' I want that process to generally be the winner in a deadlock situation, but I don't want it to get in the way of a 'must-complete' process. If 255 is a good DEADLOCK_RESOLUTION_PRIORITY value for a 'must-complete' process, what would be a reasonable priority value for a 'should-complete' process?" There is not a totally straightforward answer to that question. What you could do is this: start with some value for the 'should complete' process (maybe 150, or maybe 200, for example), and see if it ends up becoming the loser in a deadlock situation. If that happens, you can see how the priority of the "winner" process compared to the priority that you assigned to your 'should-complete' process, and potentially adjust your process's priority accordingly. How could you see the deadlock priority of a process that "beat" your process? That information is available via the IFCID 172 Db2 trace record. Activating IFCID 172 should involve very little overhead, as the trace record captures information about deadlocks, and deadlocks tend to be unusual in most Db2 systems I've seen. By the way, you should be able to use your Db2 monitor to generate a report with formatted information from IFCID 172 trace records (if you use IBM's OMEGAMON for Db2 monitor, the report to use for this purpose is called the Record Trace Report - that report can format the information in most any Db2 trace record).
  • I mentioned previously that there are exceptions to the "255 always wins" rule. Even if DEADLOCK_RESOLUTION_PRIORITY has been set to 255 for a process, that process could be the loser if it gets deadlocked with a process that is changing data in a table space defined with NOT LOGGED (hard for Db2 to roll back a unit of work when there are no associated undo records in the log), or if it gets deadlocked with a rollback or an abort or a backout process.

Setting the lock timeout limit or deadlock priority automatically for an application

What if you want an application to have a certain lock timeout limit or a certain deadlock priority, but you don't want the application to have to issue a SET CURRENT LOCK TIMEOUT or a SET SYSIBMADM.DEADLOCK_RESOLUTION_PRIORITY statement in order to accomplish the objective (SYSIBMADM is the schema for user-set-able built-in Db2 global variables - SYSIBM is the schema for built-in global variables that are set by Db2)? No problem: you can get that done using the Db2 profile tables. "Yeah," you might say, "but I want to do this for a local-to-Db2 application, and the profile tables can be used to set special register or built-in global variable values only for DDF-using applications." Actually, with Db2 13 that statement is no longer entirely true. Db2 13 allows the setting of CURRENT LOCK TIMEOUT (starting with function level 500) and DEADLOCK_RESOLUTION_PRIORITY (starting with function level 501) via profile table entries for local-to-Db2 as well as for DDF-using applications (for other special registers and built-in global variables, value-setting by way of the profile tables remains do-able only for DDF-using applications).

For a DDF-using application, the profile-defining specification (what you put in SYSIBM.DSN_PROFILE_TABLE) can be in a Db2 13 environment what it could be in a Db2 12 environment (the auth ID an application uses when connecting to the Db2 system is one example; the IP address of an application server is another example). For a local-to-Db2 application, the profile-defining specification can be auth ID and/or role, or collection name and/or package name, or the value of CLIENT_APPLNAME or CLIENT_USERID or CLIENT_WRKSTNNAME. The online Db2 13 for z/OS documentation provides additional details on using the profile tables to set values for special registers and for built-in global variables.

And there you have it. Db2 13 provides more control - and more-granular control - over two important aspects of application execution. I hope that these new capabilities will be useful at your site.

Friday, November 11, 2022

A Case Study: Using Db2 for z/OS Monitor Reports to Zero In on a Performance Problem

I had an interesting exchange recently with a Db2 for z/OS systems programmer. This individual had been asked to assist in determining the cause of a performance problem impacting a Db2-accessing application. The sysprog shared with me a Db2 monitor-generated accounting report showing activity for the application process, and a Db2 monitor statistics report covering the same time period for the same Db2 subsystem. In those two reports were the clues that pointed to the source of the application performance problem, and with the probable problem source identified the associated mitigating actions were readily determined. In this blog entry, I'll take you through my analysis of the Db2 monitor-provided accounting and statistics information, and the rationale behind my recommended steps for resolving the problem. My aim is not only to shed light on a particular performance-affecting issue and related remediating moves, but also to illustrate a methodical approach for analyzing Db2 application performance issues in general.


Input for analysis

I find Db2 monitor generated accounting and statistics reports to be extremely useful for analysis of application performance problems. In both cases, what you want is the detailed form of the report. In the case of IBM's OMEGAMON for Db2 performance monitor (the one with which I'm most familiar), you're talking about the ACCOUNTING REPORT - LONG and the STATISTICS REPORT - LONG (for other Db2 monitor products, these reports might have titles like, "detailed summary of accounting information" or "statistics detail report"). For the accounting report, your preference is to see activity pertaining exclusively to the application process for which the performance issue has arisen. This is usually done by using the monitor's data-filtering capabilities to include, for report-generation purposes, only the Db2 accounting records of interest (an accounting report is basically information from Db2 accounting trace records, formatted for readability). Db2 accounting trace records have all kinds of identifier fields, so you can get pretty specific. In the case about which I'm writing here, the relevant filtering criteria were the authorization ID of the application process of interest, and the FROM and TO times that bracketed the period during which the performance problem occurred.

As for the statistics report, what you want is one that covers the same time period as the accounting report (same FROM and TO times), for the same Db2 subsystem.


Looking for clues

What the Db2 sysprog had been told by the development team is that the application in question started out performing well, and then slowed down a lot (important input). The sysprog and I looked first at the accounting report, and in doing that we focused initially on the "class 3 suspension" information (this information, obtained from Db2 accounting trace class 3 records, has to do with "known" wait events, as opposed to "not accounted for" time, about which I'll comment momentarily). Why did we look there first? Because (in my experience), when a Db2-accessing process starts out performing well and then slows way down, it's often due to a substantial increase in one or more of the "wait times" captured by accounting trace class 3 (yes, an application slowdown could be related to a significant CPU time increase, but I've seen that less often than I've seen large wait time increases).

Looking at the class 3 times for the application process, what jumped out was a very large value for the average DB2 LATCH time ("average" is average per accounting trace record, which typically equates to average per transaction or average per batch job, depending on the workload type). Here, I mean "large" in terms of average DB2 LATCH time being a large percentage of average TOTAL CLASS 3 time. Usually, DB2 LATCH time is a very small percentage of TOTAL CLASS 3 time, with "wait time" categories such as SYNCHRON DATABASE I/O and OTHER READ I/O accounting for the bulk of TOTAL CLASS 3 time. A (proportionately) really large DB2 LATCH time is usually an indicator that something's not right.

The first thing I look at when I see unusually large DB2 LATCH time for a Db2-accessing process is the "in-Db2 not-accounted-for time" for that process. Several Db2 monitor products calculate that for you - in an IBM OMEGAMON for Db2 accounting report, the field is labeled NOTACC, and it's shown, on a sideways bar chart at the top of an accounting report, as a percentage of average in-Db2 elapsed time. If you need to calculate this figure for yourself, the denominator is average in-Db2 elapsed time (aka "class 2" elapsed time), and the numerator is average in-Db2 elapsed time minus in-Db2 CPU time (general-purpose plus zIIP, or "specialty engine," CPU time) minus TOTAL CLASS 3 time. In other words, it's the percentage of in-Db2 elapsed time that is not CPU time and not "identifiable" wait time. For a transactional application process (as was the case for the situation about which I'm writing), as a general rule you want in-Db2 not-accounted-for time to be less than 10%. If that figure is substantially greater than 10%, it's indicative of a CPU-constrained environment, and if the environment is highly CPU-constrained then DB2 LATCH time can get really large (as pointed out in an entry I posted to this blog a few years ago).

Well, in this particular case the average in-Db2 not-accounted-for time for the application process was 11% of in-Db2 elapsed time - a little on the high side for a transactional process, but not high enough to explain a really large DB2 LATCH time. With that cause of elevated DB2 LATCH time pretty much eliminated, it was time to turn to the breakdown of latch wait events for different latch categories, and that's where the statistics report comes in. In a Db2 monitor-generated statistics long report, the latch suspend count information will likely look something like this (what you see below is from an OMEGAMON for Db2 statistics long report, but it is NOT from the report I reviewed with the Db2 systems programmer - we jointly viewed that report in a web meeting, and I do not have a copy of the report):

LATCH CNT   /SECOND   /SECOND   /SECOND   /SECOND
---------  --------  --------  --------  --------
LC01-LC04      0.00      0.00      0.00      0.00
LC05-LC08      0.00      0.74      0.00      0.41
LC09-LC12      0.00      0.02      0.00      0.32
LC13-LC16      0.00     12.89      0.00      0.00
LC17-LC20      0.00      0.00      0.01      0.00
LC21-LC24      0.04      0.00      1.96      2.84
LC25-LC28      0.12      0.02      0.01      0.00
LC29-LC32      0.06      0.04      0.00      0.28
LC254          0.00

What I saw in the report I reviewed in the web meeting with the Db2 sysprog (and again, that's NOT what you see above - the snippet above is provided so that you can see what the latch suspend count information looks like in a statistics report) was a particularly high value for latch class 6 suspend events (that would be in the position highlighted in green in the sample report snippet above). What is latch class 6? It has to do with index page split activity in a Db2 data sharing environment (by the way, a handy page for seeing the activities associated with various Db2 latch classes is this one from the OMEGAMON for Db2 documentation).

Let's unpack that. An index page split occurs when Db2 has to insert an entry in an index page because of an insert (or an update of an indexed column) and that page is full. In that situation, a portion of the entries in the page will be moved to what had been an empty page in the index, so that there will be room in the formerly-full page for the new entry. What does Db2 data sharing have to do with this (and in the environment about which I'm writing, Db2 is running in data sharing mode)? In a data sharing system (versus a standalone Db2 subsystem), an index page split action has a greater impact on throughput because it forces a log-write operation.

Seeing the high level of index page split activity suggested by the numerous latch class 6 wait events, we turned again to the accounting report to see the average number of insert operations executed by the performance-impacted application. Sure enough, we saw that this was an insert-intensive process - more by the average number of rows inserted, versus the number of INSERT statements executed (the average number of rows inserted, per the accounting report, was about 100 times larger than the average number of INSERT statements executed, indicating use of block-level inserts by the application).

The elevated count of latch class 6 suspend events (related to index page splits) and the insert-intensive nature of the process also dove-tailed with the observed "started out fine, then slowed down" behavior of the application: in all probability, when the process started there was a pretty good amount of free space in leaf pages of indexes on the table(s) into which rows were being inserted. After a while these "holes" in index leaf pages filled up, and that resulted in a large number of index page split actions to make space for new index entries, and THAT - partly due to the fact that this was a Db2 data sharing system - had a majorly negative impact on application performance (the keys of affected indexes were clearly not of the continuously-ascending variety, because index page split actions are not required for an index defined on a continuously-ascending key).

With the source of the performance problem identified, the next matter to consider was...


What to do about it?

The Db2 sysprog and I discussed two problem-mitigating actions - one an "absolutely do" and the other a "maybe do." The "absolutely do" step was to increase the amount of free space in indexes to accommodate new entries. That step, in turn, was comprised of two sub-steps, each of which is applicable to indexes defined on non-continuously-ascending keys. Sub-step one: increase the index's PCTFREE value. Whereas the default PCTFREE value for an index is 10, a value of 20 or 25 might make more sense for an index defined on a non-continuously-ascending key for an insert-heavy table. Sub-step two: increase the index's FREEPAGE value. The default FREEPAGE value is 0. Here's why boosting the FREEPAGE value - for example, to 5 (in which case there would be an empty index page after every 5 pages containing index entries) - can be helpful for an index defined on a non-continuously-ascending key for an insert-heavy table: as previously mentioned, when an index page is split a portion of that page's entries are moved to a previously empty page in the index. If FREEPAGE 0 (the default) is in effect, the only empty index pages will be at the very end of the index - potentially a long way from the page that was split. That situation creates a drag on performance through degradation of the index's organization (reflected in the LEAFDIST value for the index - or index partition, in the case of a partitioned index - in the SYSIBM.SYSINDEXPART catalog table). With a non-zero and relatively low value for FREEPAGE (meaning, an empty page following each relatively-low-number of populated pages), when an index split does occur then there should be a "nearby" empty page into which entries from the full page can be moved.

Both PCTFREE and FREEPAGE can be changed for an index via an ALTER INDEX statement, and both take effect when the index is subsequently reorganized (or loaded). The larger PCTFREE value will reduce index page split activity between REORGs, and the non-zero FREEPAGE value will reduce the impact of page splits if they do occur.

And what about the "maybe do" step? That would be an increase in the size of the index's pages, from the default of 4 KB to maybe the maximum of 32 KB. How can that help? Here's how: because a 32 KB index page (for example) can hold 8 times as many entries as a 4 KB page, going to 32 KB-sized pages for an index (via an ALTER INDEX statement that assigns the index to a 32K buffer pool) can potentially result in an 87.5% reduction (seven eighths) in page split activity for an index, other things being equal (e.g., same rate of inserts for the underlying table). Why is this a "maybe do" thing versus an "absolutely do" thing? Because if access to table rows through the index (e.g., for queries) is truly random in nature with respect to key values, 32 KB-sized pages could mean somewhat less-effective use of buffer pool resources versus 4 KB-sized pages. It's a matter, then, of what's more important in a particular situation: is it minimizing index page split activity, or maximizing the effectiveness of buffer pool resources for random patterns of row access by applications?


In conclusion...

I hope that this entry has provided you with some "news you can use." Utilize Db2 monitor accounting and statistics reports to determine the source of an application performance problem, then take appropriate remedial action.

One more thing: Db2 13 for z/OS made it a lot easier to verify that index page splits are an issue, through enhanced instrumentation (new IFCID 396, associated with statistics trace class 3, which is on by default, indicates when an index page split operation takes more than 1 second, which would be unusually long) and through the new REORGTOTALSPLITS, REORGSPLITTIME and REORGEXCSPLITS columns of the SYSIBM.SYSINDEXSPACESTATS real-time statistics table in the Db2 catalog.


Friday, October 28, 2022

Getting Ready to Migrate to Db2 13 for z/OS from Db2 12

It could be said that preparation for migration of a Db2 12 for z/OS system to Db2 13 comes down to one thing: activate Db2 12 function level 510. That's a pretty simple-looking migration plan, but there's more to it than meets the eye - as I'll explain in this blog entry.

First, let's consider function level 510 itself. Unique among Db2 12 function levels, 510 provides no new functionality in the traditional sense - there's nothing new there that a DBA or a developer would use, or that could make an application execute with greater efficiency. The purpose of function level 510 is simply this: to validate that a Db2 12 system is technically ready for migration to Db2 13 (here, "system" refers to a standalone Db2 subsystem or a Db2 data sharing group).

If activating function level 510 gets a Db2 12 system ready for migration to Db2 13, what makes a Db2 12 system ready for activation of function level 510? Three things:

  1. The system's code level has to be 510. That can be easily verified by issuing the Db2 command -DISPLAY GROUP: in the command output, look for the value 121510 in a column labeled DB2 LVL (if your Db2 system's maintenance level is relatively current, the code level is likely to be 510 already - the PTF that takes a Db2 system's code to the 510 level came out in April of 2021).
  2. The catalog level has to be V12R1M509 - again, -DISPLAY GROUP output tells the tale.
  3. There can't be any packages in the system, used within the past 18 months, that were last bound or rebound prior to Db2 11.
Let me expand on items 2 and 3 in that list.


Getting to the right catalog level

First, you might be wondering, "How is it that I need to get to catalog level V12R1M509 before I can activate function level V12R1M510? Wouldn't the catalog level have to be V12R1M510?" No. There is no catalog level V12R1M510. Why is that? Because function level 510 has no catalog dependencies (i.e., no changes to a catalog at the V12R1M509 level are necessary to support function level 510). This is not at all unprecedented. Several of the Db2 12 function levels had no catalog dependencies. For example, function level 504 can be activated when the catalog level is V12R1M503 - there is no V12R1M504 catalog level because function level 504 did not require any changes to a catalog at the 503 level.

Second, suppose your Db2 catalog level is, say, V12R1M500. Can you take that catalog right to the 509 level? YES. Assuming your code level is at least 121509, you can execute the CATMAINT utility with a specification of UPDATE LEVEL(V12R1M509). That one execution of CATMAINT will make the changes to the catalog associated with the 509 catalog level, and it will make all the changes associated with all the other catalog levels between 500 and 509.

About those old packages...

"What's with that?" you might wonder. "Why would the existence of one or more packages, that we've used in the past 18 months and that were last bound or rebound prior to Db2 11, keep me from being able to activate function level 510?" Short answer: it's for your own good. Longer answer: packages are executable code. The code generated for a package by a Db2 version prior to V11 cannot be executed in a Db2 13 system. If there is a request to execute a pre-Db2 11 package in a Db2 13 environment, that package will be auto-rebound so that Db2 13 can generate for that package code that can be executed in the Db2 13 system. "OK," you say, "I get that auto-rebinds of packages can be a little disruptive with respect to my application workload, but it's not THAT big of a deal - the auto-rebind gets done and we go trucking on." Not so fast, I'd say. What if, as a result of the auto-rebind of a package, there's an access path change that negatively impacts - perhaps majorly - the performance of the associated program? Your possible response: "Again, not a huge deal. We run with PLANMGMT=EXTENDED in ZPARM, and so we'd have the prior copy of that package available, and we can just execute a REBIND with SWITCH(PREVIOUS) to restore the previous better-performing copy of the package." WRONG! You CAN'T restore the previous copy of that package, because the previous copy was generated prior to Db2 11, and that means the previous copy can't be executed in a Db2 13 system. You're STUCK with that poor-performing package. Sure, you can take steps to try to correct the performance regression - maybe update catalog statistics or take an index action and then rebind and hope for a better performance outcome - but do you want to do that while some critically important production program is performing in an unacceptable way and you're phone is going off non-stop because upper management wants to know WHEN YOU'RE GOING TO GET THIS FIXED? Probably not; so, we're going to prevent that scenario from happening by not letting you go to Db2 13 if you still have any pre-Db2 11 packages that have been used within the past 18 months (the thinking of the Db2 for z/OS development team: if a package was last used more than 18 months ago, it's highly likely that it's a package that's just not used anymore in your environment - it's still in SYSPACKAGE simply because no one has FREE-ed the old package).

This "keep you out of trouble" action taken by the Db2 for z/OS development team is based on the painful experience some organizations had when they migrated to Db2 12 from Db2 11. In that situation, we made it clear in the documentation that pre-Db2 10 packages would need to be rebound prior to going to Db2 12 because pre-Db2 10 packages could not be executed in a Db2 12 environment. Well, some Db2 for z/OS people either didn't see that warning, or saw it and decided to ignore it and take their chances, and in a few cases the problem described in the preceding paragraph was encountered. At some sites, the problem's impact was severe enough to warrant falling back to Db2 11, at which point people would rebind the pre-Db2 10 packages (as had been strongly encouraged by us) and then re-migrate to Db2 12. Not wanting to see reoccurrences of those difficulties, with Db2 13 we're basically saying, "We are not going to let you get into the potentially ugly situation you could see if a package that cannot be executed in a Db2 13 system is requested for execution in that environment - you cannot go to Db2 13 if you have pre-Db2 11 packages that might still be in use at your site.

By the way, if you want to see if you have any packages that would prevent successful activation of function level V12R1M510, you can execute this query on a Db2 12 system (and note that this query is also provided in the Db2 12 documentation):

SELECT * FROM SYSIBM.SYSPACKAGE 
  WHERE LASTUSED >= DATE(DAYS(CURRENT DATE) - 548) 
  AND RELBOUND NOT IN ('P','Q',’R’)
  AND VALID <> ‘N’ 
  AND OPERATIVE <> ‘N’;

One more thing: as previously mentioned, it's highly likely that a package that has not been executed in a Db2 system within the past 18 months will not be executed at some future time in that Db2 environment. That said, maybe you're concerned that, for some reason, a package in your environment is executed every two years (24 months). The chances of that being true are almost certainly very small, but perhaps non-zero. If that's bugging you, disregard the "18 months" window and rebind ANY pre-Db2 11 package in your system prior to going to Db2 13.


If you're nervous about a "big jump" to Db2 12 function level 510...

Consider this hypothetical situation: you're running with Db2 12 function level 500 activated and you're contemplating going to function level 510 to prepare for migration to Db2 13. That's actually not so hypothetical - a good number of Db2 12 systems are running with an activated function level of V12R1M500. If that looks familiar to you, there might be a couple of thoughts running through your mind:
  • Function level 500 to 510 looks like a really big jump to me. How do I get that done with a minimized risk of complications? The key here is the APPLCOMPAT specification for your packages. Maybe you're concerned that making a big jump up in the activated function level for your Db2 12 systems could lead to programs being impacted by a "SQL incompatibility" (basically, that's a SQL behavioral change: same SQL, same data, different result - these incompatibilities are pretty few and far between, and they often affect either few or none of your programs, but they can indeed arise on occasion). If you're indeed worried about that, you can guard against that by leaving the APPLCOMPAT values for your packages where they are when you activate a higher function level of Db2 12 for z/OS. If you have a package bound with, for example, APPLCOMPAT(V12R1M500), and you activate function level 510, SQL issued through the package bound with APPLCOMPAT(V12R1M500) will still get the SQL behavior associated with function level 500. You can find lots more information about APPLCOMPAT in the part 1 and part 2 posts of the 2-part entry on APPLCOMPAT that I added to this blog in 2019.
  • If function level 510 is a good ways beyond where my Db2 12 system is at present, could I maybe go from where we are to some intermediate function level, and later to level 510? Of course you can. "If I decide to do that," you might be thinking, "what would a good intermediate function level be for us?" That's really up to you. My advice: go to the Db2 12 function levels "main page" in the Db2 12 online documentation, and check out the features and functions introduced with each function level between 510 and where you are now. If there's a function level that provides an enhancement that would be particularly helpful in your environment, go to that one, and later, at a time of your choosing, go from that level to level 510 (I'll tell you that a pretty popular "intermediate" Db2 12 function level is 505 - this because a lot of Db2 DBAs really like the "rebind phase-in" functionality that was introduced via function level 505).
OK, that about wraps it up for this blog entry. I hope that this information will be helpful for you as you plan for your site's migration from Db2 12 to Db2 13.

P.S. In addition to activating Db2 12 function level 510, don't forget to apply the "fallback SPE" to your Db2 12 systems prior to migrating to Db2 13 - that PTF allows you to fallback from Db2 13 to Db2 12, in the unlikely event that that should be deemed necessary for you. The APAR associated with this fallback SPE is PH37108.

Wednesday, September 28, 2022

Db2 for z/OS: GETPAGEs, Predicates, SELECTs and DELETEs

About 15 years ago, I posted to the blog I maintained while working as an independent consultant (prior to rejoining IBM in 2010) an entry in which I named GETPAGEs the most important factor with regard to the CPU consumption of a Db2 for z/OS application workload. GETPAGEs, which are requests by Db2 (typically on behalf of application or user processes) to access pages of table spaces or indexes, are indeed a prime driver of SQL statement execution cost. That being the case, people who work on tuning Db2 query performance often aim to reduce GETPAGE activity associated with executing the query. That work, in turn, tends to focus on a query's predicates (the clauses of the query that determine which rows will qualify for the query's result set). Can an index be added or modified to enable index-level (versus table-level) row filtering for a given predicate? Can a non-index-able predicate be rewritten to be index-able? Can a predicate be added to further refine the query's result set? And so on.

The thing is, the predicate focus of query tuning could lead one to believe that the same GETPAGE-reducing actions could yield similar results for any SQL statement containing predicates, regardless of whether the statement is a SELECT or, say, a DELETE. That is not the case, especially for DELETEs versus queries, and the difference basically boils down to one thing: indexes. The same indexes that reduce GETPAGE activity for a query can make GETPAGE counts stubbornly high for a DELETE statement, in spite of tuning actions related to the DELETE statement's predicates. A Db2 SQL programmer recently ran up against this reality. He asked me about it, and I think his situation could be instructive for others.

The SQL programmer (I'll refer to him as R - the first letter of his first name) was analyzing the performance of a row-deleting process that would remove a few hundred thousand rows from a table in a given execution. Thinking that reducing GETPAGE activity would lower the CPU cost of the process, and approaching the tuning effort for the process's DELETE statement as one would approach tuning the performance of a query (thinking about an analogous query that would have the same predicates as the DELETE statement of interest), R had a DBA create on the target table a new index with a key comprised of the five columns referenced in a series of ANDed "equals" predicates in the DELETE statement (in other words, the DELETE had the series of predicates WHERE C1 = ? AND C2 =? AND C3 = ? AND C4 = ? AND C5 = ?, and the new index had the key C1 | C2 | C3 | C4 | C5). That would make for a MATCHCOLS value of 5 for the DELETE statement, right (referring to a column in the Db2 access path-explaining PLAN_TABLE)? And, that should reduce GETPAGE activity for the DELETE by enabling a greater degree of row filtering at the index level, right? Well, not exactly, as things turned out: R was surprised to see that the new index had very little impact on the GETPAGE count for the DELETE statement. What was going on here?

The index-impact result that surprised R comes down to essentially one thing, that being the fundamentally different nature of DELETE versus SELECT statements. Yes, both can have predicates, but that doesn't mean that a DELETE with predicates can be thought of as a "query" - DELETEs and SELECTs are like apples and oranges when it comes to their execution characteristics. A SELECT returns values (specifically, rows of values), and that's it. A certain index might sharply reduce GETPAGE activity for a query by reducing the need for Db2 to examine table space pages in order to find qualifying rows (in fact, an index could virtually eliminate table space GETPAGEs for a query, if all columns referenced in the query - in the query's select-list and in its predicates - are part of the index's key). A DELETE, on the other hand, changes pages, many of which - and this is of critical importance - will be pages in index spaces (excepting the unusual, but not unheard-of, situation in which a table has no indexes).

Given this aspect of DELETE processing, not only will adding an index to a table potentially have little impact on GETPAGE activity for a DELETE statement - it might even increase GETPAGE activity for the DELETE. Think about it: for every row removed from a table by a DELETE statement, an entry has to be removed from each and every index defined on the table (yes, it's actually a "pseudo-delete" of index entries, with the relevant entries just marked for later physical deletion, but this still involves index GETPAGEs). Not only that, but Db2 very often can't just march through an index's leaf pages, deleting index entries as corresponding rows are deleted from the underlying table - not when a given index's keys have a way-different sequence relative to the sequence in which table rows are being deleted. Maybe, because of matches on predicate columns, a DELETE statement is guided to rows qualifying for deletion by index X, but index Y, on the same table, may have a key whose ordering is very different from that of index X's (i.e., the two indexes' key values correlate little, if at all, on a row-by-row basis). In that case, finding the entry in index Y to delete as part of deleting a table row could well require an index probe operation (i.e., a top-down traversal of index Y, from root page to leaf page). If that kind of thing is happening for several indexes on the table, the number of GETPAGEs for a DELETE statement could be several times larger than the number of rows deleted; and, that's not because the DELETE has a "bad" access path - it's because the statement is a DELETE and not a SELECT.

Bottom line: comparing GETPAGE counts between SELECT and DELETE statements is not useful or meaningful, even if the statements have identical predicates. If you have a row-delete process that is consuming more CPU than you'd like, what can you do about it? Here are a couple of possibilities:

  • Reduce - don't increase - the number of indexes on the target table. As I've mentioned, it's a general truism that while indexes can boost query performance, they tend to increase CPU and elapsed time for DELETE statements. That being the case, a good way to boost DELETE performance is often to find and remove indexes on the target table that are not doing any good. I call this "index trimming," and I provided some information on that topic in an entry I posted to this blog some years ago (I'd ignore the part of that blog entry that deals with hash-organized tables - that Db2 feature was deprecated with function level 504 of Db2 12 for z/OS).
  • Consider using the DISCARD option for online REORG. Particularly when the row-delete criterion (or criteria) is not particularly complex, and can be expressed in the form of a predicate or predicates referencing the target table, executing an online REORG of the table's table space with the DISCARD option can be an attractive way to efficiently remove a large number of rows from the table with minimal disruption of application access to the data (there will always be at least a brief - sometimes just a few seconds - period of no data access, when a REORG job enters the SWITCH phase near end-of-execution).
I hope that this information will be useful for you. Don't confuse DELETE-statement tuning with SELECT-statement tuning. Two different animals, as we say.

Tuesday, August 23, 2022

What Db2 for z/OS People Should Know About Data Fabric

"Data fabric" is an increasingly hot topic in IT circles, and with good reason - an effectively implemented data fabric can deliver significant dividends by enabling an organization to get more value from its data assets. Db2 for z/OS people should have some familiarity with the data fabric concept and associated technology, not only as preparation for participating in data fabric-related discussions but also because data fabric is of major strategic importance for Db2 for z/OS (and for other z/OS-based data sources). In this blog entry I'll provide information on data fabric that I hope will be helpful to Db2 for z/OS people.


What is "data fabric," anyway?

Essentially, data fabric is an architecture that brings uniformity and consistency to data originating in a disparate collection of sources - sources which could be (likely would be) housed in a mix of on-premise and in-cloud systems (and, especially for larger enterprises, "in cloud" would involve several different public cloud providers and perhaps some private cloud environments). That uniformity and consistency is manifest in multiple aspects of data interaction via the data fabric, including data access, discovery, utilization, cataloging, protection and governance; further, a data fabric is likely to have a "smart" dimension, with AI and machine learning technology leveraged to provide intelligent automation of data management tasks.

I mentioned that the data fabric payoff is increased value gained from an organization's data assets. How does data fabric deliver that payoff? Basically, by eliminating friction that would otherwise impede data access, discovery, utilization and integration - and doing that without compromising data security. The promise of a data fabric can be largely summed up in this way: it provides an environment in which the right data (i.e., data that is current, trusted, understood and complete) is available to the right people (people who know the data, people who know what data they need, people who know what they want to do with data) at the right time (i.e., when the data is needed).

In thinking about the value of the consistency and uniformity that a data fabric brings to what would otherwise be a disjointed data landscape, it can be helpful to consider a cake-baking analogy. Suppose you are tasked with baking a cake, and suppose further that the ingredients must be ordered from different countries, and you have to communicate with suppliers using the primary language of each source country and you have to remunerate the suppliers using source-specific modes of payment. Here's how that might go (and in your mind, substitute any countries you want for the ones I mention - I'm not picking on anyone):

  • The eggs for the cake are to come from Japan, but there is a delay in procurement because you don't speak Japanese.
  • The butter is to come from Australia, but the supplier will only send the butter after having received payment in coins that were sent via sailboat.
  • The flour will come from a supplier in Germany. Your German is a little rusty, but pretty good so there's not much of a delay there.
  • The sugar is to be sourced from Brazil, but your lack of familiarity with the ingredient-ordering user interface results in your being unable to locate a supplier.
  • This all leads to your getting a late start in baking the cake, and on top of that the eggs went bad while you were waiting for the butter, and you never got the sugar. The people who were looking forward to consuming your confection had to wait a frustratingly long time to get a very un-tasty cake. Not good.
Now imagine a different scenario, in which a cake-ingredient-ordering front end abstracts the particulars of the ingredient suppliers (such as native language) and provides uniformity for payment and shipment. Using that front end, you get the ingredients you need - and all the ingredients you need - in a timely manner, and your cake consumers are delighted with the product of your kitchen, which satisfied their sweet-tooth needs and arrived at the right time.

So it is with a data fabric: different data elements from different data sources are the “ingredients” that provide a complete (sometimes called a “360”) view of a subject of interest - be that customers, processes, supply chains, products, whatever. And here's the upshot: when the right (and all the right) data ingredients get to the right people at the right time, the result is better: better decisions, better and more timely applications, better outcomes.

There is technology that can make the promise of data fabric a reality, but before getting into that I want to emphasize that data fabric is NOT just a matter of leveraging technology. I'd go so far as to say...


Data fabric is culture

There were people who said the same thing a few years ago about DevOps, and for the same reason: full and effective implementation of a data fabric can require new organizational roles and new ways of thinking about and managing data. To appreciate this assertion, consider the "personas" (i.e., the people-roles) associated with individuals who would work with, and in relation to, a data fabric. That exercise is facilitated if you think of a data fabric as something that enables a “data store,” in which people “shop for data.” For a traditional retail store, relevant personas include the following:

  • Consumers acquire products from the store.
  • Suppliers provide products for the store.
  • A store manager decides which products should go on which shelves.
  • A sales associate puts the right products on the right shelves.
OK, so what are the personas that have a relationship with the "data store" enabled by a data fabric? Some are listed below.

  • data consumer might be a developer working on a new application, or a business analyst researching the viability of a proposed new product.
  • database administrator oversees a data source that supplies the data store.
  • data curator might make decisions on what data will be available through the data store, and to whom.
  • data steward might “stock the shelves” of the data store, based on decisions made by a data curator.
Look again at those last two personas in the list above - data curator and data steward. I can tell you for a fact that those roles exist today in multiple organizations - are they present in your workplace? And note: a data fabric's impact goes beyond new organizational roles - it involves new ways of thinking about data management. Here's what I mean: historically, data was often thought of in relation to where it was stored. That manner of thinking led to “silo” situations, and the difficulty of working with data in a “cross-silo” way interfered with organizations’ extracting maximum value from their data assets. By contrast, a data fabric will deliver the greatest benefit when it supports a data management approach that focuses more on data itself, and less on where data is stored. One implication of a data-centric (versus a data-source-centric) approach to data management is that data access decisions (i.e., who can access what data, and in what form) are made by data professionals (e.g., data curators), as opposed to being made by database professionals (e.g., DBAs). In such an environment, data source administrators are implementers of data access decisions made by data curators.

If a data fabric puts data administration (versus database administration) responsibility on data professionals (e.g., data curators), does that diminish the role of a Db2 for z/OS DBA? I would say it does not. I see this is being part of an ongoing evolution of the Db2 for z/OS DBA role to be more engaged in application development (for distributed systems DBAs, this role shift became widespread some years ago). This is a good thing. I am convinced (and more importantly, so are a lot of IT leaders at Db2 for z/OS sites) that the value a mainframe Db2 DBA delivers to an organization goes up when that DBA's work has more of an application-enabling focus.

Let me shift now from organizational impact to enabling technology.


IBM's foundational data fabric-enabling technology

Multiple IBM offerings have a connection with data fabric, but the most foundationally important of these is called Cloud Pak for Data. Cloud Pak for Data's importance has a lot to do with IBM's point of view regarding data fabric implementation. We believe that a data fabric is most effectively implemented as an abstraction layer extended over an existing data landscape. Such an implementation approach acknowledges the significance of “data gravity” - the idea that data usage actions should flow to the data, rather than vice versa. A data fabric enabled via Cloud Pak for Data is characterized by "in-place” access to data on systems of origin. This approach delivers multiple benefits, including:
  • Minimization of data replication costs.
  • Protection of data security and consistency.
  • Optimized performance.
Cloud Pak for Data itself can be thought of as a set of software-powered services that relate to access, governance and usage of data. Cloud Pak for Data can be deployed anywhere Red Hat OpenShift (a Kubernetes container platform) can be deployed: on-premise, in a private cloud or in a variety of public cloud environments (it is also available in a fully managed, as-a-service form). Cloud Pak for Data can be used with a wide range of data sources on Linux, UNIX, Windows and z/OS systems, and those data sources can be on-premise and/or in-cloud.

How would Cloud Pak for Data be used by people in an organization? Here's one scenario: let's say that Val leads a development team that will soon begin work on a new application. To support this work, Val’s team will need access to some data (which happens to be in a Db2 for z/OS database) and associated metadata (data about the data). Val sends a request to this effect to Steve, a data curator. Steve is very familiar with the data that the new application will process. He logs in to Cloud Pak for Data's user interface and creates a project that will provide Val’s team with the data and metadata they need. Db2 for z/OS is one of many data sources supported by Cloud Pak for Data, and Steve creates a connection to the relevant Db2 system. Steve selects the particular tables holding the data that the new application will process and associates them with the project he created for Val's team. Steve also imports metadata for the selected tables, and enriches that metadata with statistical values, data quality scores and business terms. Finally, Steve creates a masking rule for sensitive data in a column of one of the selected Db2 tables - Val's team will be able to reference the column in their program code, but they will only see masked values when they view the column's contents. With the project created and the associated data assets published to a catalog to which Val and her teammates have access, the developers will be able to easily view the data and the related metadata, and this will enable them to move ahead quickly and productively with coding and testing.

The point I really want to make here is not so much, "Look what the data curator can do for the application development team." Even more important to me is the fact that, had Val's team needed access to data (and with it, associated metadata) in a Db2 for Linux/UNIX/Windows database, or a SQL Server database, or an Oracle database, or Apache Cassandra, or Amazon S3, or MariaDB or one of the myriad other data sources supported by Cloud Pak for Data, the actions of the data curator would have been largely the same. And, that would be the case for all kinds of other Cloud Pak for Data usage scenarios - a data scientist needing to develop and train a predictive model, a business person wanting to create a report with accompanying data visualizations, a data curator implementing new rules and policies concerning access to certain data assets, a data administrator virtualizing non-relational data to make it more easily accessible and consumable, whatever. That, as much as anything, is the "secret sauce" of a Cloud Pak for Data-enabled data fabric: it makes all kinds of data sources more easily accessible and effectively consumable by all kinds of people, without sacrificing data governance and security. And when more of an organization’s data assets are used more easily and effectively by more people, the organization works better.


Data fabric is strategically really important for z/OS as a data-serving platform

The uniformity brought to a data landscape by a data fabric is of outsized importance in the context of z/OS as a data-serving platform. How so? Think about it. What gets in the way of z/OS-based data being more effectively - and more widely - used by people in an organization? Often, it's the perceived “other-ness” of the mainframe – the sense non-mainframe people have that z/OS-based data is inherently harder to access, understand and use than data on other platforms. Truth be told, that perception has, historically, been at least partly fact-based – it has been harder for many people to access and use z/OS-based data versus off-mainframe data. The great value, then, of an effectively implemented data fabric, from a z/OS perspective, is not so much that it makes z/OS-based data easier to access and use versus off-mainframe data; rather, it’s the fact that the data fabric makes z/OS-based data as easy to access and use as off-mainframe data. Why that's so powerful: while mainframe systems have been recognized for decades as being unmatched in terms of reliability, security, scalability, efficiency and performance, there have been plenty of people who would say, "Yeah, but mainframe-based data is hard to access and use." An effective data fabric eliminates that "yeah, but..."

Let that sink in: by making discovery, understanding, consumption and usage of data in z/OS systems as easy it is for data on other platforms, a data fabric makes IBM zSystems an even higher-value platform for an organization's most valuable data assets.

If your organization has not yet looked at implementing an enterprise data fabric, now could be a good time to start down that path. And, the "in-place access to data on systems of origin" that characterizes a data fabric implemented with IBM's Cloud Pak for Data could well be the approach that will deliver maximum benefits in your environment. Give it some thought, and get engaged.

Tuesday, July 26, 2022

What Should a Modern Db2 for z/OS Client-Server Application Environment Look Like?

The distributed data facility (aka DDF) is the component of Db2 for z/OS that enables data access by applications that connect to Db2 via TCP/IP communication links. DDF has been around for over 30 years, but especially during the past 10 years or so DDF workloads have become very large at many Db2 for z/OS sites, with individual Db2 subsystems processing DDF transactions at sustained rates of over 4000 per second (and many more than that for Db2 data sharing groups running on Parallel Sysplex clusters of mainframe servers). For an ever-larger number of Db2-using organizations, the DDF workload is the largest - and the fastest-growing - component of the overall Db2 workload. Given the importance of DDF in the Db2 workload mix, it's worthwhile to consider what a modern Db2 client-server application environment should look like. In looking over the Db2 DDF scene in recent years I've seen a lot of things that I like. In this blog entry I'll share Db2 DDF application environment characteristics that get a thumbs-up from me.


Straight from the app servers to Db2 (for DRDA requesters)

For a number of years, organizations have been migrating away from the use of Db2 Connect "gateway servers" (application server to Db2 Connect gateway server, Db2 Connect gateway server to Db2 for z/OS system) in favor of direct connections from application servers to Db2 for z/OS systems by way of the IBM Data Server Driver (at many sites this transition is already complete). When access to Db2 for z/OS from DRDA requester applications is accomplished through the IBM Data Server Driver, "Db2 Connect" becomes, essentially, just a product licensing term, versus an actual product used - this because entitlement to use the IBM Data Server Driver is provided through an organization's Db2 Connect license (so, if an organization is licensed for Db2 Connect Unlimited Edition for System z, that organization can deploy the IBM Data Server Driver in an unlimited way for applications that access the mainframe system(s) associated with the Db2 Connect license).

There are several advantages to going with the direct connection to Db2 for z/OS versus going through a Db2 Connect gateway server. One is performance: with the "hop" to a Db2 Connect gateway server eliminated, better response times and throughput can be achieved. Another direct-connection benefit is improved problem diagnosis capabilities - error messages have more-specific meaning when the network-connected server that is "adjacent" to Db2 for z/OS is an application server, versus a Db2 Connect gateway server. The direct connection approach also tends to make Db2 client configuration and upgrade work more straightforward.

Note my mention that this Db2 Connect gateway versus direct Db2 for z/OS connection matter is relevant for DRDA requester applications. It is not pertinent to clients that utilize the REST interface to Db2 for z/OS, as such interactions do not involve the DRDA protocol. See below for more information about Db2 REST clients.


Greater use of the Db2 for z/OS REST interface (which is part of DDF functionality)

By "greater use" I don't mean to suggest that the REST interface to Db2 for z/OS is somehow "better" than the DRDA interface (which I like to call the SQL interface to Db2, as a DRDA requester application issues Db2-targeting SQL statements). The REST interface is different versus the SQL interface, and sometimes that difference makes it a good choice for a Db2-accessing client-server application. I wrote a blog entry a few years ago with a lot of compare-and-contrast information about the REST and SQL interfaces to Db2, and I won't repeat all that here. To Db2 for z/OS DBAs, I'd say this: 1) make sure your application developers know that Db2 has a built-in REST interface, and 2) be ready to help support use of the REST interface when that is the choice of a development team. Sometimes, developers - even those who have strong SQL skills - have a preference for the REST architectural style, often because it so fully abstracts the particulars of service-providing systems.

If you do make use of Db2's REST interface, and think you might expand on that in the future, consider what IBM z/OS Connect could do for your organization. When Db2 for z/OS is accessed through z/OS Connect, it's still Db2's REST interface that's being used (Db2 in that case is a REST provider to z/OS Connect), but z/OS Connect provides some important benefits: it makes creation of z/OS-based REST services easier, it provides richer "service discovery" information to client application developers, it adds flexibility to the formatting of service-output JSON documents, and it provides a single access point through which all manner of z/OS-based programmatic assets can be invoked through REST requests - not only Db2 SQL statements and stored procedures, but also CICS and IMS transactions (which might or might not involve access to Db2) and JES batch jobs.


Leveraging SQL procedure language (SQL PL)

SQL PL is for Db2 (for z/OS and for Linux/UNIX/Windows) what T-SQL is for SQL Server and what PLSQL is for Oracle - a way to write data-processing programs using only SQL statements. SQL PL makes that do-able via a set of SQL statements called control statements - "control" being short for "logic flow control." Among these statements are ITERATE, WHILE, GOTO, IF and LOOP - you get the idea. There are all kinds of reasons for using SQL PL, one being related to cost-of-computing: when a SQL PL routine (such as a native SQL procedure) is invoked through Db2's distributed data facility - whether through a SQL CALL issued by a DRDA requester or via a REST request - its execution is up to 60% offload-able to zIIP engines (mainframe processors that cost less than general-purpose processors and do not factor into the determination of z/OS software charges).

Besides the economic advantage of SQL PL for DDF-using applications, there are functional advantages. For example, a native SQL procedure (a stored procedure written in SQL PL) - and only a native SQL procedure - can be created (or altered) with the AUTONOMOUS option, which means that if the calling transaction fails and is rolled back by Db2, the data-changing actions (e.g., INSERT/UPDATE/DELETE) performed by the autonomous procedure will not be rolled back (this can make autonomous procedures very useful for "transaction initiation audit trail" purposes - you can use an autonomous procedure to record the fact that a transaction got started, and that information will be preserved even if the transaction ends up failing). SQL PL routines can also accept Db2-defined arrays as input, whereas external Db2 routines (written in languages such as COBOL) cannot.

Something else to consider: if you're using SQL PL only for stored procedure programs, you're missing out. SQL PL can also be used to write user-defined functions, and a SQL PL routine can be included in the body of an advanced trigger (advanced triggers were introduced with Db2 12 for z/OS).

And, you should take note of how the CREATE OR REPLACE PROCEDURE syntax introduced with Db2 12 function level 507 can enable greater agility when it comes to deploying Db2 stored procedure programs, especially those written in SQL PL.

And, you should be managing SQL PL source code (that would be, in the case of native SQL procedures, the associated CREATE PROCEDURE statements) with a source code management (SCM) tool - the SYSROUTINES table in the Db2 catalog is not intended to be a SQL PL SCM. Which SCM? Whichever one(s) your organization's developers use to manage their source code - that could be a vendor-supplied SCM or an open-source tool such as Git.


Using the right (no-charge) Db2 SQL development tool

If you (or others in your organization) are using IBM Data Studio for Db2 for z/OS SQL testing and for SQL PL routine development and debugging, it's time for a change. IBM's strategic replacement for Data Studio is the (also no-charge) Db2 for z/OS Developer Extension for Visual Studio Code (also available for the Eclipse Theia IDE).


Leveraging the Db2 profile tables

The Db2 for z/OS profile tables - SYSIBM.DSN_PROFILE_TABLE and SYSIBM.DSN_PROFILE_ATTRIBUTES - can be very helpful for a DRDA requester application workload. For one thing, they can be used to specify application-specific limits on concurrent DBAT (DDF thread) usage and/or connections established with a Db2 system and/or idle thread time - handy when the system-wide DBAT, connection and idle thread limits established via the ZPARM parameters MAXDBAT, CONDBAT and IDTHTOIN are not as granular as you need them to be. The Db2 profile tables can also be used to set the value of a number of Db2 special registers and/or built-in global variables, automatically when an application connects to the Db2 system. One example of this kind of profile table usage is setting the value of the CURRENT PACKAGE PATH special register to point a DRDA requester application to a collection in which the IBM Data Server Driver packages are bound with RELEASE(DEALLOCATE), as a means of getting high-performance DBAT functionality for the application.


Db2 for z/OS DBAs being heavily involved in application development

In way too many cases, DRDA requester or Db2 REST interface-using applications are developed with little Db2 DBA involvement, until late in the game when a production deadline is looming and physical implementation of tables is done in a rushed and sub-optimal way. Logical database design may also have happened with little DBA input, with negative consequences down the road. This situation is typically not a result of application developers giving Db2 DBAs the cold shoulder. Rather, my observation has been that some Db2 for z/OS DBAs view developers as a nuisance or an irritant - as "them." Wrong mindset. Way wrong. Db2 for z/OS DBAs maximize the value they deliver to an organization when they team with developers at the very early stages of an application development project. Not only can that help to ensure a logical and physical database design that will deliver optimal benefits for the application (and for application users), it also provides an opportunity for DBAs to ensure that developers are aware of Db2 features - temporal data support, transparent archiving, the REST interface, autonomous native SQL procedures, non-traditional data types (e.g., XML), global variables, newer built-in functions (e.g., LISTAGG), advanced triggers, whatever - that could enable and accelerate development of functionality of importance for an application. My advice is for Db2 for z/OS DBAs to think of themselves as part of the extended development team for Db2-accessing applications. That approach can be especially effective for modern Db2 client-server applications.

I hope that the information in this blog entry will be useful for you. As always, thanks for stopping by.