Monday, February 26, 2018

Db2 for z/OS: DDF, zIIP Engines, and SMT2

SQL statements executed by way of the Db2 for z/OS distributed data facility (DDF) are not, of course, the only workload that uses zIIP MIPs, but in my experience this does tend to be the main driver of zIIP utilization in a z/OS system on which Db2 runs. I posted an entry to this blog, a few years ago, on the importance of avoiding zIIP engine contention. In that blog entry, I described how to calculate the "zIIP spill-over ratio" for a Db2 DDF workload (the percentage of zIIP-eligible work that ends up being processed by general-purpose engines), and pointed out that this ratio should be less than 5% for optimal performance and CPU cost-efficiency (less than 1% is outstanding). That rule-of-thumb begs the question, "How 'hot' can I run my zIIP engine(s) while avoiding a higher-than-desired zIIP spill-over ratio?" The answer to that question, as pointed out in the blog entry, depends on the number of zIIP engines available to the z/OS LPAR in which the Db2 subsystem of interest is running: the more zIIP engines an LPAR has, the higher the rate at which they can be utilized without overly-high zIIP spill-over becoming an issue. In the entry I'm writing now, I want to supplement the blog entry just referenced (and pointed to by the preceding hyperlink) with information pertaining to an IBM Z server processor technology called simultaneous multithreading, or SMT (or SMT2, as explained below).

SMT2, introduced with the IBM z13 line of mainframe servers and available as well with the newer z14, refers to a technology whereby two processes (thus the 2 in SMT2) can be active at the same time on one "engine" (i.e., one core). You can find more information about SMT2 in an entry on the z13 that I posted to this blog a few years ago.

One thing that SMT2 affects is the look of an IBM RMF CPU Activity Report (that report, by the way, is very useful when it comes to understanding the z/OS environment in which a Db2 subsystem operates). One section of an RMF CPU Activity Report shows the average utilization (usually over a 15-minute interval of time) of the engines available to a z/OS LPAR. For a system running on an IBM Z server with SMT2 enabled, that section of the report might look something like this (with color highlighting added by me):


---CPU---    ---------------- TIME % ----------------
NUM  TYPE    ONLINE    LPAR BUSY    MVS BUSY   PARKED
 0    CP     100.00     4.28         4.15        0.00
 1    CP     100.00     1.41         1.42        0.00
TOTAL/AVERAGE           2.84         2.79           
 2    IIP    100.00    76.08        75.23        0.00
                                    71.13        0.00
 6    IIP    100.00    67.62        66.22        0.00
                                    64.94        0.00

TOTAL/AVERAGE          71.85        69.38

Here are some things to notice about the above report snippet: first, see those numbers in the "MVS BUSY" column that I highlighted in red? See how there are two such numbers for each physical zIIP engine (labeled as type "IIP")? Know what that means? It means that SMT2 is active for those zIIP engines. Could you see two MVS BUSY numbers for one general-purpose engine (labeled as type "CP")? No. Why? Because SMT2 can be activated for zIIP engines and also for IFL engines (engines dedicated to Linux systems running on an IBM Z server), but not for general-purpose engines (more on "can be activated" in a moment).

A second thing that can look different in an RMF CPU Activity Report, when SMT2 is active for a system's zIIP engines, is the section, just below the above-referenced CPU utilization section, that shows information about "in-and-ready" tasks and how often these tasks had to wait for dispatch because all of the LPAR's engines were busy. That section of the report might look like this (again, color highlighting was added by me -- and this snippet and the CPU utilization snippet shown above are from the same RMF report):


-----------------------DISTRIBUTION OF IN-READY WORK UNIT QUEUE
 NUMBER OF              0    10   20   30   40   50   60   70
 WORK UNITS     (%)     |....|....|....|....|....|....|....|...
                                            
<=  N          40.3     >>>>>>>>>>>>>>>>>>>>>
 =  N +   1     5.5     >>>
 =  N +   2    52.3     >>>>>>>>>>>>>>>>>>>>>>>>>>>
 =  N +   3     0.9     >
<=  N +   5     0.4     >
<=  N +  10     0.0
<=  N +  15     0.0
<=  N +  20     0.0
<=  N +  30     0.1     >
<=  N +  40     0.0
<=  N +  60     0.0
<=  N +  80     0.0
<=  N + 100     0.0
<=  N + 120     0.0
<=  N + 150     0.0
>   N + 150     0.0

N = NUMBER OF PROCESSORS ONLINE (6.0 ON AVG)

See that line, highlighted in red, that tells you the value of "N"? Know what "N" is? It's the number of engines available to the LPAR for which the report was run (if you see a non-integer value for N, such as 8.5, it means that the LPAR has part of an engine, or parts of several engines, available to it -- that's indicated by non-zero values in the PARKED column of the CPU utilization section of the report referenced previously). Note that in this case, N = 6, even though the LPAR has, as we saw earlier, four physical engines -- two general-purpose engines and two zIIP engines. N = 6 because two tasks can be dispatched to each of the two zIIP engines at one time (so, a total of 6 tasks can be executing at one time: two on the two general-purpose engines and four on two the zIIP engines, each of which can handle two tasks simultaneously).

Know what else can look different with SMT2 activated for your zIIPs, versus SMT2 not being activated for those engines? The zIIP spill-over ratio. Here's what I mean by that: suppose you had a Db2 for z/OS subsystem, with a significant DDF workload, running on a z13 or z14 mainframe for which SMT2 had not been activated for the zIIP engines. Suppose that on that z/OS system, with X number of zIIP engines, the zIIP spill-over ratio was Y. If you subsequently activated SMT2 for the zIIP engines, what would be the effect on the zIIP spill-over ratio for the Db2 DDF workload? I think it's pretty safe to say that the zIIP spill-over ratio would be less than Y (i.e., less than what it had been with SMT2 not activated for the zIIP engines).

How substantial is the effect of activating SMT2 for zIIP engines? Is it like turning two zIIP engines into four zIIP engines? No. Why? Because the speed at which a given task is executed is somewhat reduced when that task is dispatched to an engine for which SMT2 has been activated, versus the same engine with SMT2 not activated. Generally speaking, when SMT2 has been activated for a zIIP engine, that engine is capable of processing about 40% more work than before; so, enabling SMT2 for two zIIP engines is roughly akin to turning them into 2 X 1.4 = 2.8 engines, capacity-wise.

Now, I keep referring to SMT2 "being activated" for a zIIP engine. That tells you that a zIIP (on a z13 or z14 server) can run in either "uni-thread" or SMT2 mode. Is there any reason NOT to activate SMT2 for a system's zIIP engines? About the only scenario possibly favoring "uni-thread" mode for zIIPs that comes to my mind is one involving a zIIP-eligible process that is single-threaded and batch-like in nature, and which requires the fastest possible single-thread zIIP performance. For a workload that is primarily transactional in nature (true for most DDF workloads I've seen), I would lean strongly towards activating SMT2 for the zIIP engines of the associated z/OS LPAR. The expected result should be greater throughput and less in the way of zIIP-eligible work spilling over to general-purpose engines.