Sunday, September 27, 2015

I Don't Worry About DB2 for z/OS Buffer Pool Hit Ratios

And neither should you.

It's true that for years, mine was among a chorus of DB2 specialists' voices that placed a lot of emphasis on monitoring DB2 buffer pool hit ratios (there are variations with respect to calculating this ratio, but the basic formula is (GETPAGEs - synchronous reads) / GETPAGEs). Some years ago, it dawned on me (and on others -- and for some folks earlier than for me) that I'd been focusing on the wrong buffer pool performance metric. Nowadays, I don't look at buffer pool hit ratios at all.

What I look at instead is the total read I/O rate for each and every buffer pool. Before getting more into the details of that performance indicator, I'll give you some reasons for my not giving consideration to buffer pool hit ratios:
  • They can give you a false sense of being "done" with respect to leveraging buffer pool resources to boost system performance. People can see a value of greater than 99% for a buffer pool hit ratio and conclude, "The performance impact of this pool is as good as it's going to get. I'll turn my attention to other pools." That conclusion can be very much incorrect, as I'll explain momentarily.
  • They can cause you to disregard asynchronous buffer pool read activity. Because a commonly used formula for calculating a buffer pool's hit ratio ignores asynchronous read activity, it can lead to, at best, a benign neglect of prefetch read activity, or worse, actions that overly diminish asynchronous read space in buffer pools for the purpose of "goosing" a synchronous read-based hit ratio. That can work against your overall DB2 performance goals. As I pointed out in an entry posted to this blog a couple of years ago, prefetch reads matter.
  • Depending on how they're calculated, they can have weird values that confuse people. To avoid leaving prefetch reads out of the picture, some performance monitoring products will (or at least have, in the past) use buffer pool hit ratio formulas that take asynchronous read activity into account. On occasion, given a particular level and type of prefetch activity, such formulas will yield odd-looking results, like negative buffer pool hit ratios. Values of this nature can be hard for users to interpret.

[Note that I do care about a particular kind of hit ratio that is relevant to group buffer pools in a DB2 data sharing environment -- something I call the XI GBP read hit ratio. I described this ratio in a blog entry I posted about two months ago.]

I've mentioned that the buffer pool performance metric that matters most to me is the total read I/O rate. For a given buffer pool, that value is calculated as follows:

Total read I/O rate = (synchronous reads/second) + (asynchronous reads/second)

Getting the numbers to plug into this formula is a pretty easy thing. If you can generate a statistics long report (sometimes called a statistics detail report) for a DB2 subsystem by way of your DB2 monitor, look in such a report for the section containing buffer pool activity information. For each active pool in the environment, you'll see the synchronous read rate, and you'll also see the three asynchronous read rates, for sequential prefetch, list prefetch, and dynamic prefetch. Your monitor might already have these rates in per-second form for you, and that makes it really easy: just sum those four numbers (synchronous reads per second, sequential prefetch reads per second, list prefetch reads per second, and dynamic prefetch reads per second). If your monitor reports activity in per-minute form, just divide the four relevant values by 60 to get per-second numbers, and do the previously mentioned addition. You could also get these synchronous and asynchronous read numbers from an online display of buffer pool activity provided by your DB2 monitor, but for ongoing performance tuning, I prefer DB2 monitor-generated reports over online displays.

An alternative means of getting buffer pool activity information is to use the output of the DB2 command -DISPLAY BUFFERPOOL. If you go the command route, my suggestion is to do the following:
  1. Issue the command -DISPLAY BUFFERPOOL(ACTIVE) DETAIL on the DB2 subsystem of interest. You can disregard the output of this issuance of the command.
  2. One hour after issuing the command the first time, issue -DISPLAY BUFFERPOOL(ACTIVE) DETAIL again. The output of this second issuance of the command will capture, for each active pool in the environment, activity for the one-hour period since the first issuance of the command (the command captures activity since the buffer pools were last allocated, or since the command was last issued -- whichever occurred most recently). You can then divide the activity numbers by 3600 to get per-second figures.

In the output of the second issuance of the -DISPLAY BUFFERPOOL command you'll see a timestamp value in a DSNB409I message (for example, "INCREMENTAL STATISTICS SINCE 10:00:32 SEP 1, 2015"). Check that timestamp to verify that the value is approximately one hour since the command was issued the first time. If you see a timestamp that is, say, 25 minutes prior to the time at which the command was issued the second time, it means that someone else got in there 25 minutes ago and issued the command.

Instead of adding together four numbers, as you'd do when using information from a DB2 monitor statistics long report (or from an online monitor display of buffer pool activity), you sum five numbers found in -DISPLAY BUFFERPOOL(ACTIVE) DETAIL command output. Why five instead of four? Because in the command output, synchronous read activity is reported in two "buckets": random synchronous reads and sequential synchronous reads. Total up, then, the random synchronous reads, the sequential synchronous reads, the sequential prefetch reads, the list prefetch reads, and the dynamic prefetch reads, and divide that sum by 3600 to get the total read I/O rate per second (assuming that you issued the command once and then again an hour later, and you're using the output of the second issuance of the command). At some sites, folks have created jobs that issue the -DISPLAY BUFFERPOOL(ACTIVE) DETAIL command at a certain time of day, then issue it again an hour later, and via REXX code parse the output of the second issuance of the command, pulling out the read activity values for each pool and totaling them and dividing the total by 3600 to get the total read I/O rate per second.

What I like to see, for each buffer pool, is a total read I/O rate of less than 1000 per second. A rate of less than 100 per second for a pool is really good, unless that pool is used to "pin" objects in memory, in which case you'd like to see a read I/O rate of zero (and "pinning" pools should be defined with PGSTEAL(NONE), as described in a blog entry I wrote a few years ago). Driving read I/Os down benefits system performance in two important ways: 1) it reduces elapsed time for DB2-accessing programs, and 2) it improves the CPU efficiency of a DB2 workload (every I/O operation consumes some CPU time, so reducing that activity lowers the CPU cost of DB2 data access). The best way to lower DB2 read I/O rates is to increase the size of buffer pools that have a high level of read I/O operations. Thus, monitoring buffer pool read I/O rates points you to the pools for which the potential for performance gains through read I/O reduction is the greatest.

Less than a week ago, I was reviewing buffer pool activity statistics for a DB2 subsystem that is a key component of an organization's IT infrastructure. One of the busier pools showed a hit ratio of 99.3%. That looks pretty good, doesn't it? Seeing that, might you turn your attention to another of the subsystem's buffer pools? You could, but that would be a mistake -- this same buffer pool had a total read I/O rate of over 2800 per second -- higher than that of any other pool belonging to the subsystem. THIS was seen to be the pool most in need of attention, per the marker provided by the read I/O rate. The pool, with approximately 600,000 4K buffers (about 2.4 GB of space), could stand to be at least doubled in size so as to substantially lower the high rate of read I/O activity. Could the organization make that move? Yes, because the LPAR housing the DB2 subsystem has over 100 GB of real storage (an increasingly common configuration these days), and much of that memory is unused. With another 2.4 GB of memory utilized for buffering DB2 pages (in the form of a twice-as-large buffer pool), the z/OS LPAR's demand paging rate (available via a z/OS monitor), currently zero, would likely remain at that level. If doubling the size of the high-I/O pool took the associated read I/O rate from 2800 per second to, say, 1200 per second, I'd make the pool significantly larger again to try to get the read I/O rate below 1000 per second -- while keeping an eye on the LPAR's demand paging rate (a demand paging rate in the low single digits per second or less during busy processing periods is an indicator that a system's real storage resource is not under too much pressure).

Here's the really important point that I want to make through this blog entry: at more and more sites, you find production DB2 for z/OS subsystems running in LPARs with lots of real storage (more than 100 GB -- sometimes several hundred gigabytes). That's a good thing, if you put that big memory resource to productive use. A great way to do that is to grow DB2 buffer pools in a performance-positive way, and the total read I/O rate for the various buffer pools -- NOT the hit ratio -- is your guide for doing this in an effective manner. So, look at the right buffer pool performance numbers, and act on them in the right way.

No comments:

Post a Comment