Thursday, May 24, 2018

Db2 for z/OS Buffer Pools: Index Root Pages and the Data Management Threshold

Sometimes, a change in the way Db2 for z/OS manages its resources is introduced, and the change "flies under the radar" (i.e, it escapes the notice of most Db2 users) because it is relatively small in terms of scope and impact and is not widely highlighted in the product documentation, and it delivers the intended benefit (e.g., enhances CPU efficiency for data access) and no one thinks about it -- BUT, it may have a consequence that pops up in a surprising way, causing, initially, some confusion on the part of a Db2 administrator before the "what" and "why" of the perplexing situation is understood. In this blog entry, I'll describe a scenario of this nature. It has to do with index root pages and something called the buffer pool data management threshold.

A Db2 for z/OS consultant working in Europe brought this matter to my attention. This consultant's client had, in a test environment, a Db2 for z/OS buffer pool that kept hitting the data management threshold, and the consultant could NOT figure out why this was happening.

Some background: the data management threshold relates to Db2 buffer pools -- it's hit when 95% of the buffers in a pool are non-stealable, with "non-stealable" meaning "in-use or updated-and-not-externalized (i.e., externalized to disk or to a group buffer pool in a Db2 data sharing system)." You don't want the data management threshold to be reached for a buffer pool. Why? Because when DMTH (the acronym that refers to the threshold) is hit for a buffer pool, Db2 will issue a GETPAGE for every row retrieved from a given page in the pool; so, if 20 rows are to be retrieved from a page in a buffer pool that has hit DMTH, Db2 will issue 20 GETPAGE requests instead of one. That, in turn, will drive up overhead for access to data cached in the pool, as GETPAGEs are a major factor with regard to the CPU cost of SQL statement execution.

So, why did the pool in question regularly hitting DMTH perplex the consultant? Because the pool, though not real big, was not tiny (10,000 4K buffers), and the deferred write thresholds were at their default values (30 for DWQT, aka the horizontal deferred write threshold, and 5 for VDWQT, aka the vertical deferred write threshold), and the pool wasn't all that active (about 200 GETPAGEs per second, and just a few read I/O operations per second). With the pool's deferred write thresholds being at their default values and the pool not being particularly busy and not being particularly small, no way should DMTH be getting hit due to a large percentage of the pool's buffers being in an updated-but-not-externalized state. THAT would have to mean that DMTH was being hit because a very large percentage of the pool's buffers were in an "in-use" state. But how could that be, given the relatively non-busy state of the pool (200 GETPAGEs per second is nothing -- I've seen plenty of pools with thousands, or tens of thousands, of GETPAGEs per second).

Digging a little further for answers, the consultant used a Db2 monitor to get a more detailed view of the usage of the buffer pool for objects assigned to it. What he saw left him even more puzzled: the pool had about 10,000 indexes assigned to it (keep in mind, this is a test environment we're talking about), and a little over 9000 of those indexes each had a single "currently in-use" page in the pool. That number of in-use pages explained why DMTH was often being hit for the buffer pool (as I mentioned, buffers holding in-use pages are non-stealable, just as are buffers holding updated-but-not-externalized pages), but how could so many pages be currently in-use for a fairly low-activity pool, and why would almost all of the currently in-use pages be the ONLY in-use page for a given index? The consultant scoured the Db2 documentation in search of an explanation, and that's where he found the key that unlocked the mystery -- not in the Administration Guide, not in Managing Performance, but in the What's New manual for Db2 10 for z/OS. There, in an unexpected location (under the heading, "Reduction in need for explicit REORG"), he saw these words (color highlighting added by me):
DB2 10 provides several performance enhancements that reduce the need to reorganize indexes frequently, resulting in a reduction in CPU time and synchronous I/O waits. Compared to DB2 9, in DB2 10 you might see increased activity in deferred writes because more buffers are in use by buffer pools to improve performance. For example, activity might increase for index root pages that are pinned in a buffer pool when the index page set or index partition is opened.
(You can check this out for yourself here.)

And there you have it. This is one example -- there are plenty more -- of the Db2 for z/OS development team effecting changes that aim to leverage larger memory resources (increasingly common for z/OS systems) for enhanced performance. Index root pages are accessed a lot, and memory is becoming more plentiful on IBM Z servers, so why not "pin" those pages in the associated buffer pools at index open time, to guarantee that they'll be there when we need to read them?

OK, for a buffer pool analyzed by the aforementioned Db2 consultant, the pinned index root pages caused a DMTH problem. Does that mean that this is a "Watch out!" kind of situation in a general sense? I would say, "No." The problem described herein was encountered in a test environment, with a buffer pool that is on the small size (as mentioned, 10,000 4K buffers) and which has assigned to it a very large number of indexes (as mentioned, about 10,000). In my experience, I have NEVER encountered a situation in which a buffer pool in a production Db2 system hit DMTH due to index root pages being fixed in the pool. In production environments, buffer pools are often quite large (sometimes hundreds of thousands -- or even several million -- buffers in a single pool), and they often have assigned to them something less than 10,000 objects (and keep in mind that fixed-in-the-pool root pages are only relevant to indexes -- not table spaces). No, the reason I've written this blog entry isn't to warn you about a problem situation that is quite likely to occur in a production Db2 system. I've written this piece because you, like the consultant to whom I have referred, could conceivably encounter this buffer pool DMTH issue (probably, if at all, in a test or development environment), and if you do then having the information I've provided could save you some problem identification time and associated worry and consternation.

If in fact you encounter the index-related buffer pool DMTH situation described in this blog entry (not likely), and if you want to make it go away, the best move would probably be to add buffers to the pool in question. If the system is memory-constrained so that you can't do that, consider taking one or more of these steps:
  • Reassign some of the objects assigned to the buffer pool to a different buffer pool.
  • Lower the DSMAX value for the subsystem, so that some index data sets will periodically be closed before the buffer pool fills up with "pinned" index root pages.
  • In a data sharing environment, have the indexes defined with CLOSE YES so that the data sets will be closed when they have gone a while without being accessed from a given Db2 member.