When it comes to analyzing the performance of a Db2 for z/OS system and associated applications, I think of a set of concentric circles. The outermost circle represents the operational environment in which Db2 is processing work - that would be the z/OS LPAR (logical partition) in which the Db2 subsystem runs. The next circle within that outer one represents the Db2 subsystem itself - its buffer pool configuration, EDM pool, RID pool, lock manager, recovery log, etc. The innermost of these concentric circles relates to the applications that access Db2-managed data. With the three circles in mind, I take an "outside-in" approach to Db2 system and application tuning. In other words, I begin with a look at the z/OS system within which Db2 is running, then I turn to the Db2 subsystem itself and lastly I analyze application-centric information. The reason for this approach? If the z/OS system in which Db2 is running is constrained in some way, there's a good chance that Db2 subsystem and application tuning actions will yield little positive impact. Similarly, if the Db2 subsystem is operating in a constrained fashion then application tuning actions may not do much good.
So, if assessing the operation of a z/OS system is important prior to turning to Db2 subsystem and/or application performance analysis, how do you determine whether the z/OS LPAR in question is running in a constrained or an unconstrained way? I do this based on examination of information in two RMF reports: the Summary report and the CPU Activity report. If you support a Db2 for z/OS system, you should be familiar with the content of these reports - in particular, some key metrics provided by the reports. In this blog entry I'll point out those key metrics and explain how I use them.
[Note: I am referring to reports generated by IBM's RMF z/OS monitor because RMF is the z/OS monitor with which I am most familiar. If your organization uses another vendor's z/OS monitor, that monitor might be able to generate reports similar to the RMF reports that are the subject of this blog entry - if need be, check with the vendor on that.]
The RMF CPU Activity report
RMF, by default (you can change this), carves the time period covered by a CPU Activity report into 15-minute intervals (so, if you requested a report for a one-hour time period you will see within that report four sub-reports, each providing information for a 15-minute part of the overall one-hour time period). Within a given 15-minute interval you will see, for the z/OS LPAR for which the report was requested, information that looks something like this (I highlighted two important values in red):
---CPU--- ---------------- TIME % ----------------
NUM TYPE ONLINE LPAR BUSY MVS BUSY PARKED
0 CP 100.00 87.03 86.85 0.00
1 CP 100.00 77.76 77.68 0.00
2 CP 100.00 83.88 83.78 0.00
3 CP 100.00 87.07 86.91 0.00
4 CP 100.00 76.23 76.14 0.00
5 CP 100.00 76.79 76.71 0.00
6 CP 100.00 80.45 80.35 0.00
7 CP 100.00 73.29 73.24 0.00
8 CP 100.00 63.83 69.22 0.00
9 CP 100.00 57.78 62.95 0.00
A CP 100.00 35.28 48.33 17.01
TOTAL/AVERAGE 72.67 75.16
12 IIP 100.00 66.63 58.68 0.00
13 IIP 100.00 26.70 23.42 0.00
14 IIP 100.00 9.21 8.07 0.00
3E IIP 100.00 0.00 ----- 100.00
TOTAL/AVERAGE 25.64 26.86
- NUM - This is the ID of a given "engine" (processor core) configured for the LPAR.
- TYPE - CP is short for central processor (typically referred to as a "general-purpose engine"); IIP is short for integrated information processor (typically referred to as a "zIIP engine").
- LPAR BUSY - Engine utilization from the LPAR perspective.
- MVS BUSY - I think of this as the busy-ness of the physical engine - if the engine is used exclusively (or almost exclusively) by the LPAR in question, the LPAR busy and MVS busy numbers should be very similar.
- PARKED - This is the extent to which an engine's capacity is NOT available to the LPAR during the 15-minute interval (so, if the engine is seen to be 75% parked then the LPAR has access to 25% of that engine's processing capacity). When engines in a mainframe "box" (sometimes called a CEC - short for central electronic complex) are shared between several LPAR's it's not unusual to see a non-zero parked value for at least some of an LPAR's engines).
- TOTAL/AVERAGE MVS BUSY for the general-purpose engines (75.16 in the report snippet) - As a general rule, application performance - especially for transactional applications (e.g., CICS-Db2, IMS TM-Db2, Db2 DDF) - will be optimal if average MVS busy for an LPAR's general-purpose engines does not exceed 80%. When average MVS busy for the general-purpose engines exceeds 90%, you can see a negative impact on the performance of Db2-accessing applications in the form of what is labeled "not accounted for" time in a Db2 monitor-generated accounting long report or an online monitor display of Db2 thread detail information. Not-accounted-for time is in-Db2 elapsed time that is not CPU time and not one of the "known" Db2 wait times (those are so-called class 3 wait times, such as wait for database synchronous read, wait for lock/latch, wait for update/commit processing, etc.). It's literally elapsed time, related to SQL statement execution, for which Db2 cannot account. In my experience, in-Db2 not-accounted-for time is most often a reflection of wait-for-dispatch time, which itself is indicative of CPU contention. I'm generally not too concerned about not-accounted-for time as long as it's less than 10% of in-Db2 elapsed time for an application workload - particularly when it's a higher-priority transactional workload (you might tolerate a higher percentage of not-accounted-for time for a lower-priority batch workload). If not-accounted-for time exceeds 10% of in-Db2 elapsed time (again, especially for a higher-priority transactional workload), that would be a matter of concern for me, indicating that CPU contention is negatively impacting application throughput.
- TOTAL/AVERAGE MVS BUSY for the zIIP engines (26.86 in the report snippet) - How "hot" can you run zIIP engines before zIIP engine contention becomes a concern? That depends on how many zIIP engines the LPAR has (and, to a lesser extent, whether or not the zIIPs are running in SMT2 mode). The more zIIP engines an LPAR has, the higher the average MVS busy figure for those engines can go before zIIP contention becomes an issue (in the example shown above, the LPAR has three zIIP engines that are running in SMT2 mode - in such a situation average MVS busy for the zIIP engines could probably go to 40-50% without zIIP contention becoming an issue). And when does zIIP contention become an issue? When the zIIP spill-over percentage gets too high, as explained in an entry I posted a few years ago to this blog. [Note: the report snippet shows four zIIP engines, but the fourth - the one identified as processor number 3E - is 100% parked from the LPAR's perspective. That means the LPAR had no access to zIIP processor 3E's capacity, so in effect the LPAR had three zIIP engines during the time interval.]