Friday, November 9, 2012

DB2 for z/OS Data Sharing: Why CF LPARs need DEDICATED Engines

I was recently involved in an interesting DB2 for z/OS performance analysis effort. An organization had moved a DB2 data sharing group from one Parallel Sysplex mainframe cluster to another (if you aren't familiar with DB2 data sharing, you can check out an introductory blog entry that I posted a few years ago while working as an independent consultant). Right after that move, the CPU consumption associated with the DB2 workload on the system jumped by about 30%. DBAs diligently searched for the cause of this DB2 CPU cost spike. Was it related to an increase in the volume of DB2-accessing transactions executing in the data sharing group? No -- the size of the workload was about the same after the relocation of the DB2 subsystems as it had been before. Had SQL access paths changed, causing an increase in GETPAGE activity (something else I blogged about in my independent consultant days)? No -- GETPAGEs per transaction had stayed steady. A friend of mine supporting this system asked me: what could it be? How is it that the same workload is chewing so many more CPU cycles?

I had a hunch, and asked to see coupling facility activity reports (generated by z/OS monitor software) from before and after the relocation of the DB2 data sharing group. Sure enough, service times for synchronous requests to the DB2 lock structure and group buffer pools increased dramatically from the "before" to the "after" period -- I'm talking 300% - 500% increases. Here's why that matters: DB2 data sharing technology delivers major benefits in the areas of availability and scalability, but those benefits are not free. One cost factor is the increase in CPU consumption related to SQL statement execution in a data sharing system versus a standalone DB2 environment. That cost, which can be thought of as the CPU overhead of data sharing, is generally rather low -- typically about 10% in a well-tuned system. A very important characteristic of a well-tuned DB2 data sharing system is extremely rapid servicing of synchronous requests to the DB2 lock structure and the group buffer pools (GBPs) that are housed in coupling facility (CF) LPARs (another CF structure that supports DB2 data sharing, called the shared communications area, is accessed much less frequently than the lock structure and the group buffer pools, and so is usually not discussed in a performance context). The vast majority of requests to the lock structure and the GBPs are synchronous, and for performance purposes it is imperative that these requests be serviced with great speed, because when it's synchronous CF requests we're talking about, elapsed time directly impacts CPU time.

"How is that?" you might ask. "Why would the elapsed time of a CF request affect the CPU cost of that request?" I'll tell you. A synchronous CF request is not like a disk I/O request (or an asynchronous CF request), in that a mainframe engine driving an I/O request can go and do other things while waiting for the I/O request to complete (an interrupt will let the system know when an I/O operation has been completed, and any task that was suspended awaiting the completion of the I/O will then be eligible for re-dispatching). In the case of a synchronous CF request (such as a request for a global X-lock on a table page that is to be updated), the mainframe engine driving the CF will do nothing until the request has completed. The waiting mainframe engine is conceptually similar to a runner in a relay race awaiting the hand-off of the baton from a teammate. This special type of wait situation is known as processor "dwell" time.  The cost of "dwell" time -- the waiting engine is, for all practical purposes, busy while it's dwelling -- is charged to the application process for which the synchronous CF request is executed. The result is higher CPU consumption for SQL statement execution. Fortunately, as previously mentioned, this added CPU cost is typically quite modest. Data sharing CPU overhead tends to be low in large part because synchronous requests to the DB2 lock structure and GBPs are usually serviced VERY quickly -- it's common to see service times for synchronous requests to GBPs that average in the low double digits of microseconds, and synchronous lock structure requests are sometimes serviced in the single digits of microseconds, on average. That's a good two orders of magnitude better than the service time for a DB2 disk read -- even a read from disk controller cache.  

So, you can imagine what would happen if synchronous CF request service times for a DB2 data sharing group were to jump by several hundred percent: processor dwell time would shoot up, and so would the CPU consumption of the DB2 workload. This is what happened in the situation to which I've referred in this blog entry. Having established that, I then looked to identify the cause of the elongated synchronous CF request service times. Again on a hunch, I looked in the CF activity reports that I'd received for the section with information about the configuration of the CF LPARs on the Parallel Sysplex. There it was. In the "before" report, for each of the two CF LPARs in the Sysplex, I saw the following:

LOGICAL PROCESSORS:   DEFINED   1    EFFECTIVE    1.0
                      SHARED    0    AVG WEIGHT   0.0

And this is what the same part of the "after" report  looked like:

LOGICAL PROCESSORS:   DEFINED   1    EFFECTIVE    0.9
                      SHARED    1    AVG WEIGHT  90.0

Bingo. Now, why is that slight change such a big deal? The CF LPARs were both running at a low average rate of utilization (under 10%), so wouldn't it be OK to "split" an engine between production and test CF LPARs, with the production CF LPAR getting 90% of that engine's processing capacity? NO. z/OS LPARs (and z/VM LPARs) can share mainframe engines very effectively with minimal overhead. Not so coupling facility control code (CFCC), the operating system of a CF LPAR. To get good performance in a high-volume environment (and with production DB2 for z/OS data sharing groups you're likely to see thousands of synchronous CF requests per second), it is VERY important that the CF LPARs have dedicated engines. I communicated this message to the DB2 and z/OS teams at the organization that had been grappling with the problem of elevated DB2-related CPU utilization, and they implemented a change that effectively gave the production CF LPARs one DEDICATED engine apiece. The result? CPU utilization associated with the DB2 application workload dropped dramatically -- to a level that was even lower than it had been prior to the relocation of the DB2 data sharing group from the one Parallel Sysplex to the other (lower rather than equal, because with the relocation of the data sharing group there was also a move to internal versus external CF LPARs, and a large percentage of the synchronous CF requests were between a z/OS LPAR and a CF LPAR on the same mainframe "box" -- these requests flowed over virtual CF links that deliver tremendous performance).

The moral to this story: for your production CF LPARs, make sure that engines are DEDICATED, not shared.

With all this said, I'll offer one caveat, of which Gary King, one of IBM's top System z performance experts, recently reminded me. IF a test Parallel Sysplex has a REALLY LOW level of CF request activity, it MIGHT be possible to let production CF LPARs share engines with the test CF LPARs and still get good synchronous CF request service times in the production environment (though still not as good as you'd see with dedicated CF LPAR engines) IF you turn dynamic CF dispatching (DCFD) ON for the test CF LPARs and leave DCFD OFF (the default) for the production CF LPARs (DCFD is a CF configuration option). In that case, the test CF LPARs would generally use around 1% of the CF engines (even if you had the engine weightings split 90% for production and 10% for test). The test CF LPARs would probably get lousy service times for synchronous CF requests, but that could be OK in a low-volume test environment. There is a risk associated with this approach: if the test Parallel Sysplex got to be fairly busy, so that the test CF LPARs more fully utilized their 10% slice of the shared CF engines, the production CF LPARs would be negatively impacted, performance-wise. The safest route, then, for a production Parallel Sysplex is to go with dedicated engines for the CF LPARs (by the way, internal coupling facility engines, or ICFs, do not impact the cost of your z/OS software running on the same mainframe).

Lots of us were taught as kids that sharing is a good thing. As I've pointed out just now, there is at least one exception to that rule.

2 comments:

  1. Thank you for another excellent article.
    In the statement "That cost, which can be thought of as the CPU overhead of data sharing, is generally rather low -- typically about 10% in a well-tuned system"

    10% of what? Do you mean 10% of all CPU I see from a package (SMF 101) , or 10% of something less such at the MSTR or DBM1 regions.

    ReplyDelete
    Replies
    1. Hello, Paul. Sorry about the delay in responding.

      What I was talking about there is the the CPU time directly associated with SQL statement execution. That's called in-Db2 CPU time. It's also known as "class 2" CPU time, referring to Db2 accounting trace class 2. Let's say that in a standalone (i.e., non-data sharing) Db2 for z/OS environment, the in-Db2 CPU time for an application process is an average of 10 milliseconds per transaction (that's general-purpose plus zIIP CPU time). If 2-way (or more) data sharing is implemented for that Db2 system, my expectation would be that the average-per-transaction class 2 CPU time for the application process will be around 11 milliseconds (could be a little more or a little less, depending on various factors pertaining to workload, certain physical database design characteristics, etc.).

      Hope this clarifies my statement in the blog entry.

      Robert

      Delete