Thursday, October 29, 2015

DB2 for z/OS: DDF and Accounting Trace Records

I recently encountered a situation that served to illuminate a number of important aspects of application processing in the context of the DB2 for z/OS distributed data facility (DDF). I'll go through the scenario here, in hopes that you will find the information to be useful.

A systems programmer contacted me about a problem his organization had run into with a new application that accesses DB2 for z/OS by way of a network connection (and so uses DDF). The application drives large numbers of row-insert operations, and was generating huge numbers of DB2 accounting trace records (written to SMF data sets). In one test, executed in a pre-production environment, the application issued 140 million INSERT statements, and that resulted in DB2 cutting 140 million accounting trace records. The trace records filled the SMF data sets faster than they could be offloaded, and as a consequence some of the records were lost.

To reduce the torrent of trace records flooding the SMF data sets, the systems programmer changed the value of the ZPARM parameter ACCUMACC in the target DB2 subsystem to 100. In doing that, he made it possible for DB2 to "roll up" activity for as many as 100 DDF units of work into a single accounting trace record. That action addressed the immediate problem, but only for the test system: in the production environment, the organization had a requirement for one accounting trace record per DDF unit of work (ACCUMACC on the production DB2 subsystem was set to NO).

With roll-up of activity for multiple DDF transactions into a single accounting record not an option in production, the systems programmer looked into multi-row INSERTs (also referred to as "bulk inserts") as a means of reducing the volume of trace records that would be generated when the new application executed. He put this question to me: Could multi-row INSERT be utilized for a Java application accessing DB2 for z/OS-managed data via DDF? I responded affirmatively, and pointed out that information on multi-row INSERT for Java programs can be found in the DB2 for z/OS Application Programming Guide and Reference for Java (the DB2 10 manual can be downloaded from http://www-01.ibm.com/support/docview.wss?uid=swg27019288#manuals, and the DB2 11 manual from http://www-01.ibm.com/support/docview.wss?uid=swg27039165#manuals).

The systems programmer got together with one of the developers of the new application, and they ran a test in which 1000 rows were bulk-inserted into a DB2 table, in chunks of 100 rows. Instead of 1000 accounting trace records, execution of the test program generated 10 trace records. Here's the interesting part: the program was not issuing any explicit commits. That led the systems programmer to surmise that perhaps accounting record volume was being driven by the volume of INSERT statements issued by the application program. After all, when 140 million single-row INSERTs had been issued by the program, 140 million accounting records had been written to SMF. When 1000 rows were inserted by way of 10 bulk inserts of 1000 rows apiece, 10 trace records had been cut. I explained that accounting records are generated not by issuance of data-changing SQL DML statements, but by commits issued following execution of one or more SQL DML statements. [That's the way it works for application processes, such as DDF-using programs, that are transactional in nature -- a z/OS batch program is different, in that one accounting record will be generated when the program completes, even if the program issued hundreds of commits while executing.]. As the Java application developer was not issuing explicit commits from his program, I speculated that AutoCommit was enabled on the client side of this client-server application. Not so. The systems programmer verified that AutoCommit was set to FALSE for the new application. This information left us scratching our heads.

The sysprog next suggested that perhaps the DBAT used by the application was going inactive after each INSERT, and the accounting records were being generated as a result of DBATs going inactive (DBAT is short for database access thread -- the kind of thread used by applications that connect to DB2 via DDF). Not the case, I told him. DBATs don't go inactive in the usual sense. A connection (from a client application to DB2) will go inactive when a DDF transaction completes. Yes, the DBAT that had been used to service the transaction will go back to the DBAT pool at that time, but I'd say that "disconnected" is a better term that "inactive" for the pooled DBATs (I'm talking here about "regular" DBATs -- a high-performance DBAT will remain associated with the connection through which it was instantiated, and that connection will not go inactive until the high performance DBAT is terminated after having been reused 200 times). In any case, I said, the key is transactions completing, not connections going inactive. Connections going inactive, like accounting records getting cut (with ACCUMACC set to NO), are a response to DDF transactions completing, and transaction completion involves commit processing.

Asked the sysprog: Could packages bound with RELEASE(COMMIT) be the cause of the large volume of accounting trace records associated with execution of the new application? My response: No. Accounting trace records are generated at DDF transaction boundaries, and RELEASE(COMMIT) and RELEASE(DEALLOCATE) do not affect transactional boundaries. Commits establish transaction boundaries. Could DB2 itself be driving the commit activity that was in turn driving trace record generation? No. DB2 does not cause commits. DB2 responds to commits (an exception to that rule is the commit driven upon completion of a DB2 stored procedure defined with COMMIT ON RETURN YES).

Finally, in searching for the elusive commits that we knew were coming from somewhere, the systems programmer and his application development colleagues found the answer: the new application was using the open-source Spring Framework, and in response to the application indicating completion of a transaction, the Spring Framework would drive a commit that in turn would drive generation of a DB2 accounting trace record. With this now clarified, the application team can vary commit frequency as desired to balance volume of trace records generated (don't want too many commits) against accumulation of DB2 child X-locks associated with INSERT processing (don't want too few commits).

The key take-away here: for a lot of reasons, issuance of commits is really important for DB2-accessing application processes. Sometimes, developers of DDF-using applications have direct visibility of commit-issuing logic. Other times, use of something like a framework between a client program and a DB2 for z/OS data server (and the Spring Framework is just one example of this kind of thing) abstracts issuance of commits in a way that removes commit visibility from an application developer. When that is the case, commit frequency can still be controlled and changed as needed, but that requires an understanding of how interaction by a program with the framework drives the commits that flow to DB2. If your DB2-accessing client programs interface with an application framework, take the time to understand how commits are made to flow to DB2. In doing that, you'll avoid surprises and you'll be less likely to spend time scratching your head and spinning your wheels.

Tuesday, October 20, 2015

DB2 for z/OS: Getting a Handle on a CPU-Constrained Environment

z/OS systems are famous for their reliability, and part of that story is the ability of z/OS to accommodate surges of application activity: you can pile more and more work onto a z/OS system, and it will keep on trucking, slicing the "pie" of available processing capacity into smaller and smaller slices to keep a workload moving along. That's highly preferable to failing as a result of being overloaded, but there is a point at which applications running on a z/OS system will perform in a sub-optimal way if the system's processing capacity is undersized relative to the workload that the system is being asked to execute. If you work with DB2 for z/OS, you want to be able to spot a situation in which DB2 performance is being negatively impacted by a shortage of CPU capacity, so that your organization can take corrective action. Through this blog entry, I want to help you in that endeavor.

Recognizing indications of CPU overload in DB2 monitor reports and displays

First, know the signs of strained CPU capacity that are reflected in DB2 monitor data. With respect to overloaded general-purpose engines, the key indicator is in-DB2 not-accounted-for time. You can get that metric from a DB2 monitor-generated accounting long report (also known as an accounting detail report) or an online display of DB2 application workload activity. My preference is to use a report, and what I particularly like to use is an accounting long report with information ordered by (or the term may be "grouped by," depending on the DB2 monitor product in use at your site) connection type. That kind of report will contain a sub-report for each connection type used with the target DB2 subsystem: one for the CICS-DB2 workload (if you use CICS with DB2), one for the IMS-DB2 workload (if that's present in your environment), one for the DRDA workload (i.e., the DDF workload -- again, if you have such a workload), one for batch jobs that connect to DB2 via the call attach facility, etc. Check the in-DB2 not-accounted-for times for your higher-priority transactional workloads, such as CICS-DB2, IMS-DB2, and DRDA. Average in-DB2 not-accounted-for time should be reported by your DB2 monitor, but if you don't see a field called not-accounted-for time among the in-DB2 (also known as class 2) times, you can easily derive the value yourself: just subtract average in-DB2 CPU time (that's general-purpose engine CPU time plus zIIP, or "specialty engine," CPU time) from average in-DB2 elapsed time, and then subtract from that figure the total class 3 suspend time (class 3 times include wait for synchronous read time, wait for other read time, wait for lock/latch time, etc.). What's left is average in-DB2 not-accounted-for time. If that time is more than 10% of in-DB2 elapsed time for a higher-priority transactional workload (CICS-DB2, IMS-DB2, DRDA), you have an indication that overloaded general-purpose engines are having a detrimental effect on application throughput. [An in-DB2 not-accounted-for time that is more than 10% of a batch workload's in-DB2 elapsed time is less of an issue in my eyes, owing to the fact that 1) batch programs often run at a lower priority than transactional programs, and 2) organizations sometimes intentionally push general-purpose engine utilization close to 100% during periods of heavy batch processing.]

Don't stop with an assessment of CPU constraint related to general-purpose engines. You need to check out the situation regarding a mainframe's zIIP engines (if any), as well. Here, the most useful indicator is a field in the "class 1" column of a DB2 monitor-generated accounting report (this field might be available via online displays as well, depending on the DB2 monitor product in use at your site). The field to which I'm referring might have a label (depending on the monitor) like SECP CPU, with the "SE" being short for "specialty engine" (that would be zIIP engine) and the "CP" standing for "central processor," which is a reference to a general-purpose engine. The value of the SECP CPU field shows the average class 1 CPU time that is associated with zIIP-eligible work that ended up being dispatched to a general-purpose engine. If you're looking at a DB2 monitor-generated accounting long report with data ordered by connection type, look at the sub-report for the DRDA connection type (SQL from DRDA requesters tends to be the main DB2-related driver of zIIP engine utilization). Referring to class 1 times (versus class 2), call the value in the SECP CPU field A, and the value in the SE CPU TIME field B (this is CPU time for zIIP-eligible work that was dispatched to a zIIP engine). Perform the simple calculation A / (A + B), and what you get is the percentage of zIIP-eligible work that ran on a general-purpose engine. I call that the "zIIP spill-over percentage," and if it is higher than 5% for a workload that drives a lot of zIIP usage (and again, the prime example there, from a DB2 perspective, is the DRDA workload), that's an indicator that your zIIP engines are over-used to the point of reducing overall throughput for DB2-accessing applications.

Why does zIIP-eligible work get dispatched to general-purpose engines, leading to a non-zero value in the SECP CPU field for your DRDA workload? That happens when a zIIP-eligible piece of work is ready for dispatch but no zIIP engine is available to process that work. Why can more than a small amount of zIIP-eligible-on-general-purpose time be a performance-impacting issue? Because the system will wait a few milliseconds before dispatching a piece of zIIP-eligible work to a general-purpose engine (this to, presumably, give a zIIP engine a chance to become available in that few-millisecond window). Why is that a problem? Because prefetch processing is 100% zIIP-eligible starting with DB2 10 for z/OS, and if zIIP engines are busy enough to cause more than a small amount of zIIP-eligible work to be redirected to general purpose engines, that can slow prefetch processing. That, in turn, can reduce throughput for applications that drive a lot of prefetch read activity (and don't think that's just batch -- plenty of online transactional applications are characterized by significant prefetch activity in addition to synchronous database reads).

Check your z/OS monitor for engine-busy information

If your DB2 monitor is pointing to a shortage of general-purpose and/or zIIP engine capacity, use your z/OS monitor to check on the utilization of the engines assigned to the LPAR in which the DB2 subsystem is running. In doing this, use the right monitor-generated report for the task at hand. Take IBM's RMF (Resource Measurement Facility) monitor, for example. RMF can be used to generate a number of different reports pertaining to processor utilization. All are useful, but for different purposes. If you want to confirm a general purpose engine-utilization problem suggested by elevated in-DB2 not-accounted-for times as reported by your DB2 monitor, look at an RMF CPU Activity Report (information about that report -- and others -- can be found in the IBM Knowledge Center on the Web). In that report (generated for the LPAR of interest), if the TOTAL/AVERAGE value for the general-purpose engines (identified as CPs, short for central processors) in the MVS BUSY column is greater than 85% (for a transactional workload) AND the value in the % column of the IN READY row for "number of address spaces" <= N in the SYSTEM ADDRESS SPACE ANALYSIS section of the report is less than 80%, you likely have work delays that are caused by CPU contention (an average CP utilization greater than 85% can be OK during periods of heavy batch processing).

You'll also see in an RMF CPU Activity Report a TOTAL/AVERAGE value in the MVS BUSY column for the specialty engines (zIIP engines, identified as IIPs) assigned to the LPAR. How high can this value be before performance problems arise? That depends very much on the number of zIIP engines available. As I mentioned previously in this entry, you can get from your DB2 monitor (or should be able to get) the numbers (from an accounting detail report or online display of accounting information) that you need to calculate the "zIIP spill-over percentage" for a zIIP-driving workload, such as the DRDA workload (that ratio indicates the percentage of zIIP-eligible work that ends up being executed on general-purpose processors). I noted that you want the "zIIP spill-over ratio" to be not more than 5%. If the z/OS LPAR in which the DB2 subsystem of interest is running has a single zIIP engine, you could see an undesirable zIIP spill-over percentage if that zIIP engine's utilization is in the 30-40% range. Contrast that with a real-world situation I encountered recently in which average utilization of an LPAR's zIIP engines was 78%, while the zIIP spill-over percentage was only 1.6%. How could that be? The LPAR has 9 zIIP engines, that's how. It's queuing theory, folks. Even though these zIIPs are running at a relatively high level of utilization, because there are so many of them a zIIP engine is almost always available on this system when a zIIP-eligible piece of work is ready for dispatch. Note that starting with the EC12 z Systems servers (and continuing with the z13 servers), a mainframe can be configured with up to two zIIP engines per general-purpose engine (that ratio previously couldn't go above 1:1). The more zIIPs you have, the hotter you can run them while still avoiding an overly large degree of zIIP spill-over.

Summing it up

The fact that z Systems remain reliable servers even when running at very high levels of utilization is a very good thing, but even mainframes can be loaded with work to an extent that application performance will be sub-optimal. Using DB2 and z/OS monitor data, keep an eye on your system to make sure that a heavily-loaded server doesn't become an overloaded server. With effective capacity planning, you can keep the supply of MIPS sufficiently ahead of demand to deliver topflight performance for your mainframe-based applications.