Friday, September 26, 2014

DB2 for z/OS: Avoiding zIIP Engine Contention Issues

In an entry posted a few months ago to this blog, I touched on several matters pertaining to mainframe zIIP engines and the use of these processors by DB2 for z/OS. Based on recent experience, I feel that it's important to further highlight a particular zIIP-related issue: zIIP engine contention. Through this post, I want to explain what zIIP contention is, why you want to avoid it, how you can spot it, and what you can do about it.

As, probably, most mainframers know by now, zIIPs (System z Integrated Information Processors) are "specialty engines" aimed at reducing the cost of mainframe computing. They do this in two ways: 1) they cost less than general-purpose engines, and 2) they do not factor into the pricing of mainframe software. Over time, as IBM has made more of the work done on System z servers zIIP-eligible, and as existing zIIP-eligible workloads have grown (particularly those associated with applications that access DB2 for z/OS data through network connections -- commonly referred to as DDF or DRDA workloads), zIIP engine utilization rates have climbed at many sites. That's a good thing, but only up to a certain point. If zIIP engines become too busy, DB2 performance can be negatively impacted. I'll first expand on that point.

What zIIP contention is, and why you want to avoid it

Consider a z/OS LPAR configured with one or more zIIPs. If a zIIP engine is not available when an item of zIIP-eligible work in the system is ready for dispatch (because the zIIPs are busy at that time with other work), that piece of work will be dispatched to a general-purpose engine. There is a slight delay involved in dispatching zIIP-eligible work to a general-purpose engine. That doesn't matter much if only a small amount of zIIP-eligible work ends up being redirected to general-purpose engines. If, however, the degree of what I call "zIIP spill-over" reaches a too-high level, application performance degradation can result.

Indicators of performance degradation caused by zIIP contention

zIIPs have been around for quite a few years now (since the early 2000s, I believe). Why is the issue of zIIP contention only now coming to the fore? Two reasons: 1) it took a while for zIIP utilization at many sites to reach a level at which contention can occur, and 2) the type of work that can utilize zIIP resources recently changed in an important way. This latter point refers to the introduction, with DB2 10 for z/OS, of zIIP eligibility for prefetch read and database write I/Os. Here's why that's important: it can be a pretty good chunk of work (often accounting for the large majority of CPU time charged to the DB2 database services address space, aka DBM1), and it's a type of system task that is very important for the performance of some DB2 applications. If zIIP contention causes delays in the dispatching of prefetch read tasks, run times for prefetch-intensive workloads (such as batch jobs and data warehousing/business intelligence applications) can become elongated. If that were to happen, you might see the effect as a significant increase in a class 3 wait time (so called because the information comes from records generated when DB2 accounting trace class 3 is active) that is labeled as suspension due to "other read I/O" in an IBM OMEGAMON for DB2 accounting long report (the field might be somewhat differently labeled in accounting reports generated by other vendors' DB2 monitor products).

Prefetch slowdown isn't the only thing to look out for. Lately, I've seen data that suggests a link between higher levels of "zIIP spill-over" and higher-than-desired in-DB2 not-accounted-for time for DB2-accessing applications. Not-accounted-for time is a DB2 class 2 time (it's captured in DB2 accounting trace class 2 records) that is probably calculated for you in an accounting long report generated by your DB2 monitor. If your monitor does not calculate this number for you, you can easily figure it yourself: just take average in-DB2 (i.e., class 2) elapsed time for an application, and subtract from that the average in-DB2 CPU time (keeping in mind that this is two numbers: in-DB2 general-purpose CPU time and in-DB2 specialty engine CPU time). Next, subtract out total class 3 suspend time. What's left is in-DB2 not-accounted-for time. Generally speaking, a not-accounted-for time that is less than 10% of total in-DB2 elapsed time for a higher-priority transactional workload (such as many DDF and CICS workloads) does not concern me (a higher level of not-accounted-for time for a batch workload is usually not such a big deal). In times past, a higher-than-desired in-DB2 not-accounted-for time (i.e., more than 10% of in-DB2 elapsed time for a higher-priority transactional workload) was usually an indication that a z/OS LPAR's general-purpose engines were over-committed (as might be the case if they are routinely running at utilization levels of 95% or more). Nowadays, it appears that higher-than-desired in-DB2 not-accounted-for time might also be caused by elevated levels of zIIP contention.

An early-warning indicator

Obviously, you'd like to spot and head off a potential zIIP contention-related performance problem. How could you do that? First and foremost, I'd keep an eye on what I call the zIIP spill-over rate. That's the percentage of zIIP-eligible work that ends up running on general-purpose engines (as mentioned previously, this spill-over occurs when zIIP engines are busy when zIIP-eligible work is ready for dispatch). Here's how to calculate the zIIP spill-over rate: first, use your Db2 monitor to generate an "accounting report - long" (that's what IBM's OMEGAMON for Db2 calls it - other monitors call the report something like "accounting summary, long"), and make sure that the data in the report is ordered by connection type. With data ordered by connection type, there will be, within the larger accounting report, several sub-reports - one for each connection type used to access the Db2 subsystem (for example, one sub-report will show all CICS-Db2 work, another will show all work associated with applications that access the subsystem using the Db2 call attachment facility, etc.). In the accounting report, go the to sub-report for the DRDA connection type (this sub-report shows all work associated with applications that access the Db2 subsystem through the distributed data facility). In this sub-report, look for the following fields (I'm using labels and headings found in an OMEGAMON for Db2 report - some monitors put this information under the heading, "average appl(class 1)"):

MEASURED/ELIG TIMES  APPL (CL1)
-------------------  ---------- 
CP CPU TIME            0.001935  (C)

 ELIGIBLE FOR SECP     0.000002  (A)

SE CPU TIME            0.000781  (B)

The number labeled (C) is the average general-purpose CPU time for the workload (average per accounting record, which is usually close to average per transaction). The number labeled (B) is the average zIIP engine CPU time. The number labeled (A) is the average amount of general-purpose CPU time consumed in doing zIIP-eligible work.  The zIIP spill-over rate is calculated as follows:

A / (A + B)

The figures above show a zIIP spill-over rate of less than 1%. Such a small non-zero value would not cause me concern. What would cause me some concern is a rate that is over 5%. With that much zIIP-eligible work running on general-purpose engines, you could see the beginnings of a negative impact on application performance.

So, at what level of zIIP engine utilization would you see a potentially performance-impacting zIIP spill-over rate? That depends on the number of zIIP engines configured in a z/OS LPAR. If an LPAR has only one zIIP engine, you could see a higher-than-desired rate of zIIP spill-over with that one engine running somewhere in the 30s, with respect to percentage of utilization. With four zIIP engines, utilization could probably go to around 50% with little spill-over of zIIP-eligible work to general-purpose engines. I saw data for one system in which average zIIP utilization (during a reporting interval, referring to an IBM RMF CPU Activity Report) was 70%, and zIIP spill-over was around 1% - that z/OS LPAR was configured with 9 zIIP engines. The more zIIP engines you have, the hotter you can run them while still keeping zIIP spill-over to general-purpose engines at a very low rate.

Doing something about zIIP contention

If you observe a higher-than-desired level of zIIP spill-over in one of your z/OS LPARs, what can you do about it? One obvious response would be to add zIIP capacity to the system (keep in mind that EC12 and BC12 servers can have up to two zIIPs for every general-purpose engine). If that is not feasible in the near term, see if you can take some steps to reduce zIIP utilization. An example of such a step would be a significant enlargement of one or more DB2 buffer pools (assuming that you have sufficient memory to back larger pools). This could reduce zIIP utilization by reducing synchronous reads associated with a DDF workload, and by reducing prefetch read activity (something that became zIIP-relevant starting with DB2 10). Using high-performance DBATs could reduce the overall CPU cost (and thus the zIIP consumption) of DDF-connected applications.

Summing it up

The moral of this story is pretty simple: do NOT think that you can run zIIPs as "hot" as you can general-purpose mainframe engines without encountering application performance issues. In doing capacity planning, anticipate growth in zIIP-eligible work on your z/OS systems (not only because of overall workload growth, but also because new types of work become zIIP-eligible through new releases -- and sometimes new maintenance levels -- of subsystem and operating system software), and aim to keep zIIP capacity ahead of zIIP demand so as to keep the rate of zIIP spill-over low. Calculate the zIIP spill-over rate for your production DB2 for z/OS systems, and track that number -- make it one of your key metrics for monitoring the performance of mainframe DB2 data servers.

Driving up zIIP utilization can be an effective way of enhancing the price performance of a z/OS system -- just don't take it so far as to be counterproductive.

No comments:

Post a Comment