Friday, May 31, 2019

Db2 for z/OS: IIPHONORPRIORITY and zIIP offload

There is something I have encountered a couple of times (most recently, just a month ago) that can have a negative performance impact on a Db2 for z/OS system, and it's something of which quite a few Db2 people are, I think, unaware. I'm referring here to the setting of a z/OS parameter called IIPHONORPRIORITY, and I want to use this blog entry to provide information on how the value of that parameter can affect the processing of a Db2 workload.

IIPHONORPRIORITY is a parameter in the IEAOPTxx member of the z/OS data set called SYS1.PARMLIB. The value of the parameter can be set to NO or YES. Essentially, those two settings have the following meanings:

  • YES - Honor the priority of zIIP-eligible tasks (per the z/OS LPAR's workload manager policy), so that in the event that such a task is ready for dispatch and the LPAR's zIIP engines are busy, allow the task to be dispatched to a general-purpose engine so that wait-for-dispatch time for the task will not be too high.
  • NO - Regardless of the priority of zIIP-eligible tasks, limit the processing of such tasks to zIIP engines. If a zIIP-eligible task is ready for dispatch and the LPAR's zIIP engines are busy, require that the task wait until a zIIP engine becomes available.

Now, at first consideration a setting of IIPHONORPRIORITY=NO might appear to be an attractive alternative. After all, the raison d'ĂȘtre for zIIP engines is reduced cost of computing on a z/OS platform. Why not maximize that cost benefit by forcing all zIIP-eligible work to be processed by zIIP engines? In fact, however, if Db2 for z/OS is in the picture then specifying IIPHONORPRIORITY=NO can lead to significant performance degradation for an application workload (especially a DDF workload) and can even reduce offloading of work to zIIP engines. I have seen these effects with my own eyes, and I will describe them in more detail below.

Negative impact on application performance. Just about a month ago, as part of a review of an organization's production Db2 for z/OS environment, I looked through a Db2 monitor-generated accounting long report for the subsystem of interest. Data in the report, showing activity for a busy hour of the day, was ordered by connection type (meaning that the data was aggregated by connection type used by Db2-accessing programs: DDF, CICS, call attachment facility, etc.). In the DDF section of the report (connection type: DRDA), I saw that not-accounted-for time was 65% of in-Db2 elapsed time for DDF-using applications. In-Db2 not-accounted-for time is time associated with SQL statement execution that is not CPU time and not "known" wait time (the latter term referring to Db2 accounting trace class 3 wait times such as wait for synchronous read, wait for lock, wait for latch, etc.). In my experience, in-Db2 not-accounted-for time is primarily reflective of wait-for-dispatch time, and for a transactional workload (like the typical DDF workload) it should be a relatively small percentage of average in-Db2 elapsed time - under 10% is good, between 10 and 20% is marginally acceptable. 65% is an extremely elevated level of not-accounted-for time as a percentage of in-Db2 elapsed time.

I was initially perplexed by the very high level of in-Db2 not-accounted-for time seen for the DDF workload. The level of utilization of the system's general-purpose and zIIP engines, as seen in a z/OS monitor-generated CPU activity report, was not high enough to make that a prime culprit. Later, seeing unusual numbers in a Db2 monitor-generated statistics long report (more on that to come) prompted me to ask the question, "Do you folks by chance have IIPHONORPRIORITY set to NO?" It turned out that that was indeed the case, and was very likely the root cause of the very high percentage of in-Db2 not-accounted-for time observed for DDF-using applications: zIIP-eligible tasks servicing requests from Db2 client programs were spending a considerable amount of time queued up waiting for a zIIP engine to become available, with redirection to a general-purpose engine having been removed as an option via the IIPHONORPRIORITY=NO specification. A much better approach would be to have IIPHONORPRIORITY set to YES, and to have zIIP engine capacity sufficient to keep "spill-over" of zIIP-eligible work to general-purpose engines at an acceptably low level (information on monitoring spill-over of zIIP-eligible work to general-purpose engines can be found in a blog entry I posted on that topic); and, keep in mind that running zIIPs in SMT2 mode can help to minimize the zIIP spill-over rate.

Allowing some zIIP-eligible work to be directed to general-purpose engines when an LPAR's zIIP engines are busy (which should not often be the case when the LPAR is configured with an adequate amount of zIIP capacity), made possible via IIPHONORPRIORITY=YES, is a safety valve that enables consistently good performance for a zIIP-heavy workload such as that associated with DDF-using applications.

Potential non-maximization of zIIP offload. You might wonder, "How could setting IIPHONORPRIORITY to NO possibly have a reductive effect on offloading of work to zIIP engines?" I'll tell you how: when IIPHONORPRIORITY=NO is in effect, "Db2 does not allow system tasks to become eligible for zIIP processing." I have that in quotes because the words are taken directly from a page in the Db2 for z/OS Knowledge Center on IBM's Web site (on that page, scroll down to the part under the heading, "IIPHONORPRIORITY parameter"). To understand what the quoted phrase means, consider that zIIP-eligible Db2 tasks can be broadly divided into two categories: user tasks and system tasks. zIIP-eligible user tasks (which can also be thought of as application tasks) include those under which SQL statements issued by DRDA requesters - and SQL statements issued by native SQL procedures called through DDF - execute. Those zIIP-eligible tasks must be processed by zIIP engines when IIPHONORPRIORITY is set to NO. zIIP-eligible system tasks include those under which operations such as prefetch reads and database writes execute, and those tasks cannot run on zIIP engines when IIPHONORPRIORITY is set to NO (because in that case Db2 makes those tasks non-zIIP-eligible).

So, think about it: if you have sufficient zIIP capacity to keep spill-over of zIIP-eligible work to general-purpose engines at a low level, which should be the case in a production Db2 for z/OS environment (and I consider "low level" to be less than 1%), and you have IIPHONORPRIORITY set to NO, you might actually be causing total zIIP offload of Db2-related work to be lower than it otherwise would be, because prefetch reads and database writes, which are 100% eligible when IIPHONORPRIORITY is set to YES, are 0% zIIP-eligible when IIPHONORPRIORITY=NO is in effect. That's no small thing - in some Db2 systems, there is a great deal of prefetch read activity. I mentioned previously that a telltale sign that IIPHONORPRIORITY is set to NO in a z/OS system in which Db2 runs can be found in a Db2 monitor-generated statistics long report. In such a report, scroll down to the section in which CPU consumption attributed to the various Db2 address spaces is shown. In that report section, check the CPU times for the Db2 database services address space. If IIPHONORPRIORITY is set to YES, it is highly likely that a majority - often, a very large majority - of that address space's CPU time is in the zIIP column (the column in the report labeled PREEMPT IIP SRB, or something similar to that) versus the general-purpose engine column (labeled CP CPU TIME, or something similar). This is a reflection of the fact that prefetch reads and database writes very often constitute most of the work done by the Db2 database services address space, in terms of CPU resources consumed. If, on the other hand, IIPHONORPRIORITY is set to NO, it is quite likely that a majority - and maybe a large majority - of the Db2 database services address space's CPU time will be in the general-purpose engine column versus the zIIP column.

Bottom line: if you want to actually maximize offloading of Db2-related work to zIIP engines, do two things: 1) ensure that IIPHONORPRIORITY is set to YES, and 2) have enough zIIP capacity to keep the zIIP spill-over rate (the percentage of zIIP-eligible work that ends up being executed on general-purpose engines) below 1%. Refer to the blog entry to which I provided a link, above, to see how to calculate the zIIP spill-over rate using numbers from a Db2 monitor-generated accounting long report.

Is there any situation in which going with IIPHONORPRIORITY=NO might be reasonable? I'd say, "Maybe," if you're talking about a test or development environment in which optimal DDF application performance is not very important. If the z/OS system in question is one in which a production Db2 subsystem runs, I'd be hard pressed to come up with a justification for IIPHONORPRIORITY=NO. For a production Db2 system, you want IIPHONORPRIORITY=YES. You might want to check on this at your site.

Friday, April 26, 2019

Db2 12 for z/OS Statistics Profiles: Just What the Optimizer Ordered

Here is an interesting story for you: not long ago, an analytics-type query was executed on a Db2 for z/OS system that had recently been migrated to Db2 12. The query was cancelled after it had run for 23 hours. A DBA noticed that there was a row for a table targeted by the query in SYSIBM.SYSTABLES_PROFILES in the Db2 catalog. He ran RUNSTATS for the table, specifying USE PROFILE so that the utility would pick up the options specified in the aforementioned statistics profile. The DBA observed that the RUNSTATS job "ran a little longer than usual," an indication that the utility was doing something it hadn't previously done in generating statistics for the target table. The formerly very-long-running query was subsequently re-submitted, and it successfully ran to completion in 5 minutes. Do I have your attention? Good - read on.

First, the dramatic improvement in query response time described above is not about magic. It's about statistics. Db2 for z/OS has a very sophisticated query optimizer (IBM invented cost-based SQL statement optimization more than 35 years ago), but to do its job well that optimizer needs the right statistics related to tables accessed by queries. How important are those statistics? This important: the team at the IBM Support Center that works on problems involving query performance has said for years that over 75% of these problems are resolved not by rewriting the query, not by creating or modifying an index, but by providing the Db2 optimizer with the statistics it needs to generate a well-performing access plan.

Knowing that, you might think, "OK, so how do I know what statistics the optimizer needs in order to generate the access plan that will deliver the best-possible performance for a query?" And there, traditionally, has been the rub: it can be tricky to get this right, especially for a complex query. If you have the right skills and experience, you can pore over a query access plan as rendered by a visual EXPLAIN tool, and make a good determination of the statistics, currently absent from the catalog, that would appear to be useful to the optimizer for the case at hand. Alternatively, you could "go big" on statistics-gathering for your tables, and instruct RUNSTATS (or LOAD or REORG when inline statistics-gathering is requested) to generate every possible item of statistical data for your tables, but that would substantially increase the CPU cost of utility execution. There's also the option of using a tool that has a statistics recommendation feature (an example of a product in that space is IBM's Db2 Query Workload Tuner for z/OS). Or - and this is new with Db2 12 - you can let Db2 take care of this.

Let me expand on what Db2 12 has introduced with regard to getting catalog statistics right. Db2 11 for z/OS introduced a new catalog table, called SYSIBM.SYSSTATFEEDBACK, into which Db2 places information about statistics seen by the optimizer, in the course of generating query access plans, as being absent or inconsistent (regarding the latter, consider that one can insert statistical information into catalog tables on one's own, perhaps causing an inconsistency - a value in one column that should reflect the value in another column, but doesn't). That's fine, but sometimes that statistics feedback data would just sit there - you could look at it and use it (if you correctly interpreted what you saw), or a tool could read and act on it, but Db2 itself wouldn't do anything with SYSSTATFEEDBACK data. That changed with Db2 12: in a Db2 12 system (assuming that the new STATFDBK_PROFILE parameter in ZPARM is set to the default value of YES), when data is written to SYSIBM.SYSSTATFEEDBACK, data is also written to SYSIBM.SYSTABLES_PROFILES - either a statistics profile row will be added for a table, or an existing statistics profile row will be updated. That new or updated statistics profile row provides the RUNSTATS specifications that will gather the statistics that the Db2 query optimizer itself has indicated that it needs to do what it does to best possible effect.

AND, when I say, "RUNSTATS specifications," don't think that a Db2 12-generated statistics profile is usable only with RUNSTATS. In a Db2 12 environment you can also specify USE PROFILE when executing LOAD or REORG - something you couldn't do in a Db2 11 system.

What if you want to see what kind of statistics profiles Db2 12 is generating? That information is in the PROFILE_TEXT column of SYSIBM.SYSTABLES_PROFILES. The data type for the PROFILE_TEXT column is CLOB. You can view data in a CLOB column via a query issued through IBM Data Studio or IBM Data Server Manager (two no-charge, download-able tools); alternatively, you could view PROFILE_TEXT data by issuing a query through SPUFI, if you cast the CLOB column as VARCHAR. In the example below (a query I've executed using SPUFI), I've gone with VARCHAR(500) - you can specify whatever length you need in your environment.

SELECT CAST(SCHEMA AS CHAR(30)) AS SCHEMA,    
  CAST(TBNAME AS CHAR(30)) AS TBNAME,         
  CAST(PROFILE_TEXT AS VARCHAR(500)) AS PROFILE_OPTIONS
  FROM SYSIBM.SYSTABLES_PROFILES
  WHERE...

Given the enhancement described in this blog entry, I think it would be a very good idea to utilize the USE PROFILE option of RUNSTATS (and of LOAD and REORG, as appropriate) in a Db2 12 system. What if you do that when executing RUNSTATS (or LOAD or REORG) for a table, and there is no row for the table in SYSIBM.SYSTABLES_PROFILES? Not to worry: in that case, COLUMN(ALL) INDEX(ALL) will be used for the RUNSTATS (or LOAD or REORG) job (if a RUNSTATS or LOAD or REORG job is executed without a table being named, and if no row(s) for the table(s) in the target table space exist in SYSIBM.SYSTABLES_PROFILES, the job will be executed using options TABLE(ALL) and INDEX(ALL)). What if you have already populated SYSIBM.SYSTABLES_PROFILES via the use of SET PROFILE (or UPDATE PROFILE) in RUNSTATS jobs? Again, not a problem. Those user-developed statistics profiles can still be used in a Db2 12 environment. Keep in mind that Db2 12 will create or update a statistics profile row in SYSTABLES_PROFILES only when the Db2 optimizer determines that statistics needed to properly optimize a query are missing and/or inconsistent. If a statistics profile you generated yourself gathers the statistics that the optimizer needs, Db2 will not change that profile in the catalog.

For a long time, when confronted with a poorly-performing query, Db2 DBAs have slogged through query analysis and tuning, often a time-consuming task. If your Db2 for z/OS system is at Version 12 (and STATFDBK_PROFILE is set to YES in ZPARM), I'd recommend a different approach in the face of a query performance problem: first, see if a row (or rows) for the table (or tables) accessed by the query exist in SYSIBM.SYSTABLES_PROFILES. If there is such a row (or rows) in that catalog table, execute RUNSTATS for the table(s) with a USE PROFILE specification. Then re-submit the query. If performance is still not what it needs to be, engage in the usual query analysis process. If RUNSTATS with USE PROFILE does the trick for you, you've saved yourself a lot of time and effort.

Oh, and one more thing: when the query I referenced at the beginning of this blog entry was initially resubmitted following execution of RUNSTATS with USE PROFILE for the target table, its performance did not improve at all. Why? Because the previously-prepared form of the query was still in the Db2 subsystem's dynamic statement cache. The DBA I mentioned ran RUNSTATS with INVALIDATECACHE YES to invalidate that prepared form of the statement, and when the query was subsequently submitted it was reoptimized using the statistics generated through the Db2-generated profile, and that's when runtime went from 23-hours-then-cancel to 5 minutes.

Gather catalog statistics using profiles that Db2 generates based on optimizer feedback, and you, too, could realize some nice performance results.

Sunday, March 31, 2019

A Case Study: Implementing a Db2 for z/OS Implicit Trusted Connection

I've been interested in the role and trusted context features of Db2 for z/OS for a long time (I posted a two-part blog entry on the topic back in 2011). Theory is nice, but what about practice? Recently, I had an opportunity to assist a DBA in implementing implicit trusted connection functionality in his Db2 for z/OS environment (more on the implicit part in a moment). Today I'll share some of what the DBA and I learned through that endeavor.

First, the backstory: the DBA's organization wanted to tighten Db2 for z/OS data security by preventing unauthorized utilization of the authentication credentials used by an application to connect to a Db2 system. The kind of application we're talking about here is one that accesses a Db2 for z/OS server through a TCP/IP link via the IBM Data Server Driver (or its predecessor, Db2 Connect). In requesting a connection to a Db2 system, the application provides an authorization ID and an associated password (assuming authentication using an ID and a password, versus an ID and a certificate). Often, the ID and password in question are known by several of the application's developers. What is to keep someone from using the application's authorization credentials to connect to the Db2 system from, say, a laptop PC, thereby gaining access to data that should be accessible only through the application?

Enter roles and trusted contexts. These Db2 features work together to shut down the potential security exposure just described. How does this work? First of all, any Db2 privileges needed by the application (e.g., the SELECT privilege on table T1) are NOT granted to the application's Db2 authorization ID. Those privileges are instead granted to a Db2 role, which I'll call APP1ROLE. Next, a Db2 trusted context is created to specify the circumstances under which the Db2 privileges granted to the role can be used by an application. If the trusted context is named TRCON1, and if the application authenticates to the Db2 system using authorization ID APP1ID and the application runs on the two application servers at IP addresses 1.2.3.4 and 1.2.3.5, the Db2 DDL statement defining the trusted context might look something like this:

CREATE TRUSTED CONTEXT TRCON1
BASED UPON CONNECTION USING SYSTEM AUTHID APP1ID
DEFAULT ROLE APP1ROLE
ATTRIBUTES (ADDRESS ’1.2.3.4’, ADDRESS '1.2.3.5')
ENABLE;

Here's what the above DDL statement is essentially saying: "The Db2 privileges granted to role APP1ROLE can be used ONLY by the application that connects to Db2 using authorization ID APP1ID, and ONLY when that application connects to this Db2 system from either IP address 1.2.3.4 or IP address 1.2.3.5." Following creation of the role and trusted context, suppose that someone who know's the application's authentication credentials uses them to connect to the Db2 system from a laptop PC. Assuming a successful connection from the laptop PC, is that an "Uh-oh" situation? No. Why not? Because no Db2 privileges have been granted to the application's authorization ID (connecting to a Db2 for z/OS system doesn't get you anywhere if you can't do anything after connecting). But what about the privileges granted to role APP1ROLE, which we've associated with the application's authorization ID? Those privileges are not available to the person who used the application's credentials to connect to the Db2 system from a laptop PC because the laptop's IP address is not one of the two specified in the definition of the trusted context (those being the addresses of the two application servers on which the application runs). What has been accomplished through the creation of the role and trusted context is a vast reduction in what security people sometimes call the "threat area": instead of worrying about the application's credentials being misused from any IP address, we just need to secure the two IP addresses referenced in the trusted context definition (in other words, we need to ensure that an individual cannot make an unauthorized connection to the Db2 system from the two application servers on which the application of interest runs).

Back to my DBA friend. His first attempt at successful use of a Db2 role and trusted context didn't work: the SELECT privilege on table T1 was granted to a role but not to authorization ID APP1ID (using the example ID previously referenced), and an attempt to execute SELECT COL1 FROM T1 by way of a trusted context, using ID APP1ID, failed with a Db2 "ID not authorized to perform this action" error code. We decided to check to make sure that the IP address used in defining the trusted context was the right one. To help with that verification, we 1) turned auto-commit off on the tool being used by the developer who was helping us with the test, and 2) granted SELECT ON T1 to APP1ID. Why did we do these two things? So that we could see the IP address of the developer's Db2 thread in Db2 -DISPLAY THREAD command output (granting SELECT ON T1 to APP1ID ensured that the thread information wouldn't disappear due to the aforementioned authorization error, and making sure that a commit wasn't issued after the successful SELECT similarly preserved the thread's information, which would have disappeared when the thread went into a disconnected state following a commit).

After we'd taken those two actions to keep the developer's thread in a connected state following the SELECT FROM T1, we had the developer try again. Sure enough, we saw information about the developer's still-connected Db2 thread in -DISPLAY THREAD(*) DETAIL output, but the client IP address information was missing. That information was absent because -DISPLAY THREAD(*) DETAIL was an overly-generic form of the command -DISPLAY THREAD. When we switched to the form -DISPLAY THREAD(*) LOCATION(*) DETAIL, which tells Db2 that you only want to see information about DDF threads, we got additional DDF-related information in the command, including the client IP address associated with the developer's transaction (that address is in the V445 section of the command's output). The client address we saw was the one used in the definition of the trusted context, so why didn't the trusted context work for us? Because in this case the developer's connection to the Db2 system was through a Db2 Connect "gateway" server (versus a direct connection from the client via the IBM Data Server Driver), and in that case the client IP address that Db2 "sees" is the address of the Db2 Connect gateway server, not the address of the client that is "upstream" from the Db2 Connect gateway server.

OK, so we addressed that problem by re-defining the trusted context, this time using the IP address of the Db2 Connect gateway server. Well, not exactly. We actually re-defined the trusted context by specifying the domain name of the Db2 Connect gateway server, instead of its IP address. That can certainly be done in a CREATE (or ALTER) TRUSTED CONTEXT statement, and such a specification can be useful when (as was the case for my DBA friend's organization) the IP address of the Db2 Connect gateway server used by an application can change from time to time (if the application server references the Db2 Connect gateway server's domain name, a change in the gateway server's IP address does not require a client-side change - the domain name is simply mapped to the new gateway server's IP address versus the address of the old gateway server). We tried again with the redefined trusted context AND IT STILL DIDN'T WORK. ARRGGGHHH. What was going on? Further investigation revealed the problem: the domain name we were given by a network administration colleague of the DBA was actually associated with a pair of IP addresses (corresponding to a pair of Db2 Connect gateway servers), and for a domain name specification in the ATTRIBUTES part of a CREATE (or ALTER) TRUSTED CONTEXT statement to work properly, it has to be associated with a single IP address.

Once again, the DBA re-defined the trusted context, this time using two domain names as address attributes, with each domain name being specific to the one IP address of one of the two Db2 Connect gateway servers in a cluster. He revoked SELECT ON T1 from the APP1ID authorization ID, and had the developer try again with a SELECT from T1 using APP1ID, AND IT WORKED!! The successful execution of the SELECT from T1 showed that the privilege granted to the role APP1ROLE was being used, and indeed we saw in the output of a -DISPLAY THREAD(*) LOCATION(*) DETAIL command that the developer's thread was using the established trusted context (that information is in the V485 section of the command's output).

Now, about the "implicit" nature of the trusted connection, referenced way up at the top of this blog entry (and in the entry's title): I wanted to make sure to emphasize the implicit nature of the trusted connection used in this example, because so much of the documentation on Db2 for z/OS trusted contexts pertains to explicit trusted connections. What's the difference? Well, in the latter case a client application explicitly requests a trusted connection to the Db2 server, whereas in the former case the connection is just a "regular" connection from the client application's perspective, and it becomes a trusted connection because a trusted context matching the application's authorization ID and IP address (or domain name or whatever other attribute is specified) has been defined on the Db2 side. If you can get a trusted connection to a Db2 server (one that enables use of the Db2 privileges granted to a role associated with the trusted connection) without having to explicitly request such a connection, why would you go the explicit route? You'd go that route if you needed functionality that is ONLY available for explicit trusted connections. An example of that functionality: an application, having requested and obtained an explicit trusted connection to a Db2 for z/OS server, can then tell Db2 that it is "switching" use of the explicit trusted connection from its ID (the ID the application used to connect to the Db2 server) to the ID of an end user of the application. That switching from one ID to another (and the associated flowing by the application of an actual end-user's ID to the Db2 server system) is one of the things that has to happen if you want to make use of (for example) the enterprise identity mapping capability provided by RACF (whereby an individual's "network" ID can be mapped to a RACF ID). Note that a trusted context defined for the purpose of supporting explicit trusted connection functionality would likely include the WITH USE FOR clause in the CREATE TRUSTED CONTEXT statement - that clause can be used to reference end-user IDs passed to Db2 by an application. In the situation I've described in this blog entry, the DBA did not require explicit trusted connection functionality, so setting up implicit trusted connection capability was appropriate.

I hope that the information I've provided here will be useful to you if you want to utilize implicit trusted connection functionality in your Db2 for z/OS environment.

Monday, February 25, 2019

Db2 for z/OS: Isolating REST Client Activity in Db2 Monitor Accounting Reports

In an entry posted to this blog some years ago, I described the form of a Db2 monitor-generated accounting report that I most like to use to get an overall view of application activity in a Db2 for z/OS environment: an accounting long report with data ordered by connection type (here, I am using terminology associated with the Db2 monitor with which I'm most familiar: IBM's Tivoli OMEGAMON XE for Db2 Performance Expert on z/OS). I like to see data in an accounting report aggregated at the connection type level because within such a report there is a sub-report with detailed accounting information for each connection type through which the target Db2 subsystem was accessed: a sub-report showing all CICS-related work for the subsystem, one showing all TSO attach-related work, one showing all DDF-related work, etc.

Now, the thing about the DDF-related activity in this form of an accounting report is that it is presently - thanks to a recent and important Db2 for z/OS enhancement - kind of mis-labeled. The DDF-related work in an accounting long report with data ordered by connection type is distinguished by the label CONNTYPE: DRDA. How is that a bit misleading? Well, the native REST interface to Db2 for z/OS (the enhancement to which I referred previously - designed into Db2 12 and retrofitted to Db2 11) is an extension of the Db2 distributed data facility (DDF's long name). What that means - as pointed out in a blog entry I posted a few months ago - is that there are now two paths into DDF: the DRDA path (which application developers might think of as the SQL path) and the REST path; thus, CONNTYPE: DRDA is a little off because what you see in that part of an order-by-connection-type Db2 monitor accounting report is all DDF-related activity on the Db2 subsystem of interest - activity associated with DRDA requesters and activity associated with REST clients.

What should the designator of DDF-related activity in an order-by-connection-type Db2 monitor accounting report be? CONNTYPE: DRDA+REST? CONNTYPE: DDF? Or, should there be a CONNTYPE: DRDA for work that is strictly related to actual DRDA requesters and a CONNTYPE: REST for activity related to REST clients?

I'll let the Db2 monitor developers figure that out. Meanwhile, if you have a Db2 subsystem on which there is a good bit of activity related to both DRDA requesters and REST clients, and you want to see, in one Db2 monitor accounting report, just the REST-related activity, how can you do that? If all the REST clients authenticated to the Db2 subsystem using only two distinct authorization IDs - for example, ABCID and XYZID - then you could get your REST-only accounting report by telling the Db2 monitor to include, in the accounting report, only activity related to those primary authorization IDs. The way you'd convey that request to OMEGAMON for Db2 would be through this specification in the report control statement:

INCLUDE (PRIMAUTH(ABCID XYZID))

Obviously, that approach becomes unwieldy if there are, say, a hundred or more IDs used by Db2 REST clients.

Is there a better way? Yes, there is. It so happens that a Db2 accounting trace record associated with a REST client interaction will include, in the correlation ID field (the field name is QWHCCV), the value 'DB2_REST'. Thus, one can easily get a Db2 monitor accounting report showing all REST-related activity for a subsystem, and only that activity, by informing the monitor that the report should include data only for DDF-related requests for which the correlation ID value is DB2_REST. Again using OMEGAMON for Db2 report command syntax (which is what I know), that would be done with this specification:

INCLUDE(CONNTYPE(DRDA)
        CORRNAME(DB2_REST))

And there you have it. I hope that this information will be useful to you, should your site start seeing a significant amount of Db2 for z/OS access via the REST interface (and I think that a substantial number of sites will see just that in the near future - the REST interface has generated a lot of interest among application developers).

Friday, January 4, 2019

A Case Study: Measuring the Impact of Db2 for z/OS Buffer Pool Changes

Not long ago, a Db2 for z/OS administrator contacted me with a request. His team had made some changes to the buffer pool configuration in their production Db2 environment (the overall size of the buffer pool configuration was increased to reduce read I/O activity, and greater use was made of large real storage page frames), and they were not sure as to how the performance impact of these changes could be measured. Could I assist with that analysis?

Sure, I said. I asked the Db2 admin to use his Db2 performance monitor tool to generate detailed accounting reports (depending on the monitor used, such a report might be titled ACCOUNTING REPORT - LONG or ACCOUNTING SUMMARY - LONG) capturing activity during periods preceding and following the buffer pool changes - same FROM and TO times, same day of the week for both reports (to make the comparison as "apples to apples" as possible). I also requested that the reports be generated so as to have data aggregated at the connection type level (again, terminology can vary by monitor product - your instruction to the monitor might be to "order" the data by connection type, or to "group" the data by connection type). Why do I like for data in a detailed accounting report to be aggregated in this way? Well, I find the default aggregation specification for Db2 monitor-generated accounting reports - authorization ID within plan name - to be too fine-grained for effective analysis of a Db2 system in an overall sense; conversely, aggregation at the subsystem ID level is generally too coarse-grained for my liking. When data in a detailed accounting report is aggregated at the connection type level, you get within the larger report a sub-report for each connection type used in the Db2 for z/OS system of interest; for example: a sub-report showing all CICS-Db2 work in the system, one showing all batch work that uses the call attachment facility, one showing all batch work using the TSO attachment facility, one showing all DDF-related work (connection type: DRDA), etc. With such a report in hand (or on your screen), you can quickly and easily identify the largest Db2 workload components (with "component" here referring to work associated with a given connection type) and focus your analysis accordingly.

I showed the Db2 admin and one of his colleagues how to put the aforementioned "before" and "after" Db2 monitor accounting reports to use in measuring the impact of the effected buffer pool configuration changes, and I'll provide that knowledge transfer here in the remainder of this blog entry.

First, I look over the sub-reports (by connection type) within the larger "before" and "after" detailed accounting reports to find the workload components (connection types) that dominate in terms of size. "Size" refers in this case to aggregate in-Db2 CPU cost of SQL statement execution, and that's easy to determine: for a given workload component (a given connection type), find first the average in-Db2 (i.e., class 2) CPU time (done by summing two values: average class 2 CP CPU time, which is general-purpose CPU time, and average class 2 SE CPU time, which is "specialty engine," or zIIP, CPU time). Next, take that average class 2 CPU time for the connection type, and multiply it by the number of occurrences (typically, number of accounting records), because "average" is "average per occurrence." Here is an example of what you might see in one of the sub-reports (this for a DRDA workload), with the numbers I've referenced highlighted in red:

CONNTYPE: DRDA

AVERAGE.     DB2 (CL.2)                   HIGHLIGHTS
-----------  ----------                   -----------------------
                                          #OCCURRENCES  : 3620821
CP CPU TIME    0.000169

SE CPU TIME.   0.000223

So in this case, aggregate CPU cost of SQL statement execution (in-Db2, or class 2, accounting time is time associated with SQL statement execution) for the DRDA workload component is (0.000169 + 0.000223) * 3620821 = 1419 CPU seconds.

Performing this simple calculation showed that for the overall Db2 workload I was analyzing, the two biggest components by far were the DRDA and CICS-Db2 connection types; so, I didn't bother with the other much-smaller components (call attachment facility, TSO attachment facility, etc.).

For the two major workload components (DRDA and CICS-Db2), I compared "before" and "after" figures for average class 2 CPU time. That is the key metric for evaluating the impact of a Db2 system or application performance tuning action. Too often, Db2 for z/OS people effect a change aimed at improving workload performance and look, for evidence of improvement, at the total class 2 CPU time for the workload of interest (and by the way, I focus on class 2 CPU time because that is what a Db2 person can influence - class 1 CPU time includes time outside of Db2, over which a Db2 person may not have much control). "Before" and "after" figures for total class 2 CPU time for a workload may show little or nothing in the way of difference, leading a Db2 person to think that a performance tuning action accomplished nothing. In fact, it may be that the tuning change was indeed successful. How could that be, if total in-Db2 CPU time did not decrease? Simple: look for a change in throughput (the number of commits is a pretty good indicator of "pieces of Db2 work" completed on a system). A Db2 tuning change might reduce average in-Db2 CPU time per transaction (or, more broadly, per unit of work), and that might result in greater throughput (units of work completed in a given time period), and that (more transactions completed, at less in-Db2 CPU time per transaction) might result in a non-decrease in the total in-Db2 CPU time for the workload. In my experience, greater throughput with no increase in total in-Db2 CPU time for a workload is a GOOD thing. If you see increased throughput following a Db2 performance tuning change and your organization actually wants to keep throughput at the lower "before" level, that can be accomplished by limiting the availability of Db2 threads for the workload in question - essentially, you keep a lid on throughput by introducing some queuing for Db2 threads.

OK, so your focus as a Db2 person should be on reducing average class 2 CPU time for a workload, as indicated in a Db2 monitor-generated accounting report (or an online monitor display of thread activity on the system - but I prefer to look at accounting reports); and, in tracking this, make sure that you look at "average total" class 2 CPU time, meaning the sum of average class 2 general-purpose CPU time (usually labeled CP CPU time) and average class 2 zIIP CPU time (usually labeled SE CPU time) - always check the average class 2 zIIP CPU time, even for non-DDF-related applications that you might not think of as drivers of zIIP utilization, because that zIIP time figure could be non-zero (for example, if some queries in a batch job are parallelized by Db2, some of the batch job's class 2 CPU time will be consumed on zIIP engines because the "pieces" of queries parallelized by Db2 are zIIP-eligible). As indicated by the average-class-2-CPU-time yardstick, how did the Db2 team with which I worked do, with regard to improving application workload?

They did quite well:

                      Avg cl 2 CPU (GP + zIIP) - seconds
Workload component      Before     After        Change
------------------    --------  --------  ------------
CICS-Db2              0.001322  0.001245   6% decrease
DRDA                  0.000392  0.000259  34% decrease

That 6% improvement in average class 2 CPU time for CICS-Db2 transactions is close to what I am accustomed to seeing as a result of buffer pool actions such as those taken by the Db2 team in the case of which I'm writing: larger buffer pools, and greater use of page-fixed pools backed by large real storage page frames, tend to shave a few percentage points off of average class 2 CPU times reported by a Db2 monitor. A modest improvement of that nature is generally welcome, as it requires no application code changes and positively impacts a large number of programs (it's the kind of change that I put in the "rising tide lifts all boats" category).

What about the 34% improvement in class 2 CPU efficiency seen for the DRDA workload as a result of the buffer pool changes? Why such a large improvement? Here's my theory: as is true in many cases, the the SQL statements associated with the DRDA workload at this site are overwhelmingly dynamic in nature (that is commonly seen when DDF transactions do not make heavy use of static SQL-issuing stored procedures). Dynamic SQL statements can be automatically re-optimized on a regular basis, particularly when a "bounce" (stop and restart) of a Db2 subsystem requires total re-population of the Db2 dynamic statement cache in memory (where the prepared-for-execution form of dynamic SQL statements are stored for re-use). The Db2 subsystem on which this blog entry is focused was indeed bounced on the weekend during which the buffer pool changes were put into effect, so ALL of the dynamic SQL statements coming into the system following that Db2 stop/restart were optimized at first execution. And here's the thing: it's not just indexes and catalog statistics that influence SQL statement access path selection. Db2 memory resources - particularly buffer pool resources - are another factor. I have a feeling that when the dynamic SQL statements (especially queries) associated with the DRDA workload were re-optimized following the significant changes made to the Db2 subsystem's buffer pool configuration, some new and better-performing access plans were generated. Further evidence of positive access path changes: average GETPAGEs per transaction for the DDF workload decreased by 24% following the buffer pool changes. GETPAGE activity is a major determinant of the CPU cost of SQL statement execution, and a substantial decrease in such activity is often indicative of access path changes.

This kind of begs the question: would a rebind of the packages associated with the CICS-Db2 workload (a workload for which, at this site and many other sites, almost 100% of the SQL statements are static) following the buffer pool changes have led to per-transaction CPU efficiency gains going beyond the 6% improvement that we saw? Quite possibly. That's something to think about. If you do decide to rebind static SQL packages following some significant buffer pool changes, keep a couple of things in mind. First, rebinding with APCOMPARE(WARN) will cause Db2 to let you know (via message DSNT285I) when access paths change for one or more statements associated with a package (information about access path changes is provided in the REMARKS column of the PLAN_TABLE when a package is rebound with EXPLAIN(YES)). Second, if access paths change and package performance gets worse as a result (not likely, but possible), you can quickly and easily put the previous instance of the package back into use through a REBIND with SWITCH(PREVIOUS), assuming that plan management is in effect for your Db2 subsystem (and it is "on" by default - check the value of the PLANMGMT parameter in the ZPARM module). Together, these Db2 features are aimed at "taking the fear out of rebinding," as a former colleague of mine so aptly put it.

So, there you go. If you make Db2 changes aimed at boosting performance for a workload, measure the result using the average class 2 CPU time figures provided in Db2 monitor-generated accounting detail reports that show activity from "before" and "after" time periods. Buffer pool changes (larger pools, greater use of page-fixed pools backed by large real storage page frames) usually reduce average class 2 CPU time by a few percentage points. Greater performance gains could be seen if the buffer pool changes enable Db2 to choose new and better-performing access paths for SQL statements. Those statement re-optimization actions tend to be automatic for dynamic SQL statements (especially if a Db2 subsystem's dynamic statement cache has to be re-populated). For static SQL, package rebind is needed for statement re-optimization. Rebinding with APCOMPARE(WARN) and EXPLAIN(YES) will provide information about access path changes, and in the unlikely event of a negative performance result following re-optimization, rebinding with SWITCH(PREVIOUS) puts the previous instance of the package back into use.