Tuesday, October 20, 2015

DB2 for z/OS: Getting a Handle on a CPU-Constrained Environment

z/OS systems are famous for their reliability, and part of that story is the ability of z/OS to accommodate surges of application activity: you can pile more and more work onto a z/OS system, and it will keep on trucking, slicing the "pie" of available processing capacity into smaller and smaller slices to keep a workload moving along. That's highly preferable to failing as a result of being overloaded, but there is a point at which applications running on a z/OS system will perform in a sub-optimal way if the system's processing capacity is undersized relative to the workload that the system is being asked to execute. If you work with DB2 for z/OS, you want to be able to spot a situation in which DB2 performance is being negatively impacted by a shortage of CPU capacity, so that your organization can take corrective action. Through this blog entry, I want to help you in that endeavor.

Recognizing indications of CPU overload in DB2 monitor reports and displays

First, know the signs of strained CPU capacity that are reflected in DB2 monitor data. With respect to overloaded general-purpose engines, the key indicator is in-DB2 not-accounted-for time. You can get that metric from a DB2 monitor-generated accounting long report (also known as an accounting detail report) or an online display of DB2 application workload activity. My preference is to use a report, and what I particularly like to use is an accounting long report with information ordered by (or the term may be "grouped by," depending on the DB2 monitor product in use at your site) connection type. That kind of report will contain a sub-report for each connection type used with the target DB2 subsystem: one for the CICS-DB2 workload (if you use CICS with DB2), one for the IMS-DB2 workload (if that's present in your environment), one for the DRDA workload (i.e., the DDF workload -- again, if you have such a workload), one for batch jobs that connect to DB2 via the call attach facility, etc. Check the in-DB2 not-accounted-for times for your higher-priority transactional workloads, such as CICS-DB2, IMS-DB2, and DRDA. Average in-DB2 not-accounted-for time should be reported by your DB2 monitor, but if you don't see a field called not-accounted-for time among the in-DB2 (also known as class 2) times, you can easily derive the value yourself: just subtract average in-DB2 CPU time (that's general-purpose engine CPU time plus zIIP, or "specialty engine," CPU time) from average in-DB2 elapsed time, and then subtract from that figure the total class 3 suspend time (class 3 times include wait for synchronous read time, wait for other read time, wait for lock/latch time, etc.). What's left is average in-DB2 not-accounted-for time. If that time is more than 10% of in-DB2 elapsed time for a higher-priority transactional workload (CICS-DB2, IMS-DB2, DRDA), you have an indication that overloaded general-purpose engines are having a detrimental effect on application throughput. [An in-DB2 not-accounted-for time that is more than 10% of a batch workload's in-DB2 elapsed time is less of an issue in my eyes, owing to the fact that 1) batch programs often run at a lower priority than transactional programs, and 2) organizations sometimes intentionally push general-purpose engine utilization close to 100% during periods of heavy batch processing.]

Don't stop with an assessment of CPU constraint related to general-purpose engines. You need to check out the situation regarding a mainframe's zIIP engines (if any), as well. Here, the most useful indicator is a field in the "class 1" column of a DB2 monitor-generated accounting report (this field might be available via online displays as well, depending on the DB2 monitor product in use at your site). The field to which I'm referring might have a label (depending on the monitor) like SECP CPU, with the "SE" being short for "specialty engine" (that would be zIIP engine) and the "CP" standing for "central processor," which is a reference to a general-purpose engine. The value of the SECP CPU field shows the average class 1 CPU time that is associated with zIIP-eligible work that ended up being dispatched to a general-purpose engine. If you're looking at a DB2 monitor-generated accounting long report with data ordered by connection type, look at the sub-report for the DRDA connection type (SQL from DRDA requesters tends to be the main DB2-related driver of zIIP engine utilization). Referring to class 1 times (versus class 2), call the value in the SECP CPU field A, and the value in the SE CPU TIME field B (this is CPU time for zIIP-eligible work that was dispatched to a zIIP engine). Perform the simple calculation A / (A + B), and what you get is the percentage of zIIP-eligible work that ran on a general-purpose engine. I call that the "zIIP spill-over percentage," and if it is higher than 5% for a workload that drives a lot of zIIP usage (and again, the prime example there, from a DB2 perspective, is the DRDA workload), that's an indicator that your zIIP engines are over-used to the point of reducing overall throughput for DB2-accessing applications.

Why does zIIP-eligible work get dispatched to general-purpose engines, leading to a non-zero value in the SECP CPU field for your DRDA workload? That happens when a zIIP-eligible piece of work is ready for dispatch but no zIIP engine is available to process that work. Why can more than a small amount of zIIP-eligible-on-general-purpose time be a performance-impacting issue? Because the system will wait a few milliseconds before dispatching a piece of zIIP-eligible work to a general-purpose engine (this to, presumably, give a zIIP engine a chance to become available in that few-millisecond window). Why is that a problem? Because prefetch processing is 100% zIIP-eligible starting with DB2 10 for z/OS, and if zIIP engines are busy enough to cause more than a small amount of zIIP-eligible work to be redirected to general purpose engines, that can slow prefetch processing. That, in turn, can reduce throughput for applications that drive a lot of prefetch read activity (and don't think that's just batch -- plenty of online transactional applications are characterized by significant prefetch activity in addition to synchronous database reads).

Check your z/OS monitor for engine-busy information

If your DB2 monitor is pointing to a shortage of general-purpose and/or zIIP engine capacity, use your z/OS monitor to check on the utilization of the engines assigned to the LPAR in which the DB2 subsystem is running. In doing this, use the right monitor-generated report for the task at hand. Take IBM's RMF (Resource Measurement Facility) monitor, for example. RMF can be used to generate a number of different reports pertaining to processor utilization. All are useful, but for different purposes. If you want to confirm a general purpose engine-utilization problem suggested by elevated in-DB2 not-accounted-for times as reported by your DB2 monitor, look at an RMF CPU Activity Report (information about that report -- and others -- can be found in the IBM Knowledge Center on the Web). In that report (generated for the LPAR of interest), if the TOTAL/AVERAGE value for the general-purpose engines (identified as CPs, short for central processors) in the MVS BUSY column is greater than 85% (for a transactional workload) AND the value in the % column of the IN READY row for "number of address spaces" <= N in the SYSTEM ADDRESS SPACE ANALYSIS section of the report is less than 80%, you likely have work delays that are caused by CPU contention (an average CP utilization greater than 85% can be OK during periods of heavy batch processing).

You'll also see in an RMF CPU Activity Report a TOTAL/AVERAGE value in the MVS BUSY column for the specialty engines (zIIP engines, identified as IIPs) assigned to the LPAR. How high can this value be before performance problems arise? That depends very much on the number of zIIP engines available. As I mentioned previously in this entry, you can get from your DB2 monitor (or should be able to get) the numbers (from an accounting detail report or online display of accounting information) that you need to calculate the "zIIP spill-over percentage" for a zIIP-driving workload, such as the DRDA workload (that ratio indicates the percentage of zIIP-eligible work that ends up being executed on general-purpose processors). I noted that you want the "zIIP spill-over ratio" to be not more than 5%. If the z/OS LPAR in which the DB2 subsystem of interest is running has a single zIIP engine, you could see an undesirable zIIP spill-over percentage if that zIIP engine's utilization is in the 30-40% range. Contrast that with a real-world situation I encountered recently in which average utilization of an LPAR's zIIP engines was 78%, while the zIIP spill-over percentage was only 1.6%. How could that be? The LPAR has 9 zIIP engines, that's how. It's queuing theory, folks. Even though these zIIPs are running at a relatively high level of utilization, because there are so many of them a zIIP engine is almost always available on this system when a zIIP-eligible piece of work is ready for dispatch. Note that starting with the EC12 z Systems servers (and continuing with the z13 servers), a mainframe can be configured with up to two zIIP engines per general-purpose engine (that ratio previously couldn't go above 1:1). The more zIIPs you have, the hotter you can run them while still avoiding an overly large degree of zIIP spill-over.

Summing it up

The fact that z Systems remain reliable servers even when running at very high levels of utilization is a very good thing, but even mainframes can be loaded with work to an extent that application performance will be sub-optimal. Using DB2 and z/OS monitor data, keep an eye on your system to make sure that a heavily-loaded server doesn't become an overloaded server. With effective capacity planning, you can keep the supply of MIPS sufficiently ahead of demand to deliver topflight performance for your mainframe-based applications.

No comments:

Post a Comment