A DB2 for z/OS DBA recently asked me about the pros and cons of hash-organized tables -- a physical database design option that is introduced with DB2 10 for z/OS running in new-function mode (NFM). Assuming that others in the mainframe DB2 community are also pondering this issue, I've decided to provide what I hope will be some useful information in this blog entry.
For as long as I can remember, there have been dichotomies with respect to DB2 for z/OS table spaces: partitioned and non-partitioned, single-table and multi-table, segmented and non-segmented, compressed and non-compressed, etc. In my mind, we really didn't have such dichotomies in the realm of tables: DB2 tables were just DB2 tables (with allowances for some specific and distinguishing usage scenarios such as clone tables and materialized query tables). Now, with DB2 10 NFM, we have what I see as a bona fide dichotomy in DB2 table-world: there are hash-organized tables and there are... Are what? "Non-hash-organized tables?" That seems to me to be a pretty clunky term. I prefer to describe a table that is not hash-organized as "cluster-organized," because that speaks to the gist of the difference between the two table types. Whereas rows in a cluster-organized table are arranged within the table according to a designated clustering key (with a requisite clustering index), rows in a hash-organized table will be physically placed according to the value generated when a row's hash key is run through a hashing algorithm. By way of the hashing algorithm, rows are mapped to the table's pages in such a way as to make the physical location of rows within a hash-organized table essentially random with respect to the unhashed key values; so, if an ORDER table has a unique key called ORDERNO (and a unique key can of course be comprised of more than one column), and if that table is clustered by the ORDERNO column, then the rows for ORDERNO 1122 and ORDERNO 1123 are likely to be on the same page (assuming a well-organized table). If, on the other hand, the same table is defined to be hash-organized with ORDERNO as the hash key (a hash key must be unique), it is very unlikely that the rows for ORDERNO 1122 and ORDERNO 1123 will be located on the same page.
Now, this idea of rows being purposely (and very) non-clustered might seem to fly in the face of traditional DB2 for z/OS physical database design wisdom, and indeed it does. Lots of us have known for years about the importance (in terms of query performance) of "locality of reference" -- the notion that rows in a set to be retrieved by a query should be located near each other in the target table. What, then, is the upside to hash-organization of rows in a table? Well, if the set of rows to be retrieved by a query in fact consists of only one row, locality of reference doesn't matter. What does matter? Getting to that one row as efficiently as possible. And how do we measure efficiency here? GETPAGEs, my friend. In mainframe DB2 parlance, a GETPAGE is a request by DB2 to examine a table or an index page. GETPAGEs are very important when it comes to the CPU cost of query execution (as I noted in a blog entry on the topic that I posted a few years ago while working as an independent DB2 consultant): enable generation of a query's result set with a smaller number of GETPAGEs, and it's a pretty safe bet that you've reduced the CPU cost of that query's execution. So, how low can you go when it comes to GETPAGE minimization? How about one GETPAGE? And how do you get to one GETPAGE for a query? With a hash-organized table, WHEN the query retrieves a single row that is qualified by an "equals" predicate that references the table's hash key. Using the example from the preceding paragraph, if you had a query like this one:
SELECT CUSTNAME, DATE, ...
FROM ORDER
WHERE ORDERNO = 1005;
You'd be likely to see one GETPAGE if the query targeted a hash-organized table (assuming that ORDERNO is the table's hash key), and maybe four GETPAGEs if the query targeted a cluster-organized table (3 index GETPAGEs if the table had a three-level index on ORDERNO, and 1 table space GETPAGE). How does DB2 get the row requested by the query with only one GETPAGE when the target table is hash-organized? Simple: it takes the hash key value referenced in the query's predicate, runs it through the hashing algorithm, and thereby obtains the row's location in the table. Due largely to the difference in GETPAGE activity, the query targeting the hash-organized table would consume less CPU time when executed. How much less will depend on several factors, but you can expect the CPU efficiency of a query with a 1-row result set with an "equals" predicate referencing a hash-organized table's hash key to be even better than retrieval of the same row from a cluster-organized table via index-only access (heretofore thought of as the cheapest way to get data from a table, and sometimes pursued in the extreme as an access path, with virtually every column in a table being included in an index key for this purpose). And, speaking of indexes, you don't even need to have an index on a hash-organized table's hash key, as I pointed out in an entry posted to this blog a couple of months ago.
Now, in the first line of the paragraph above this one, I mentioned that you're likely to see one GETPAGE for the query targeting the hash-organized table. That suggests that in some cases you could see more than one GETPAGE. When would that happen? When the target row is in fact not located in the page mapped to via the hash of the hash key value. That would be the case if on insert of the row the hash-mapped page turned out to be full. In such a situation the row is placed in the table's hash overflow area, and it would be located when needed via a special hash overflow index that contains entries only for rows located in the hash overflow area. Note that even retrieval of overflow rows should be quite efficient, as the overflow index should have a very small number of levels (since it should be pretty sparse).
This overflow business gets at one of the downsides of hash-organized tables: they'll generally take up 20-50% more disk space versus equivalent cluster-organized tables. This generosity of space is necessary to allow the large majority of rows to be located in pages identified via the hashing algorithm, and for the overflow area that accommodates rows that can't be so located. The DB2 REORG utility offers you an option -- a good one to take, by the way -- whereby DB2 will calculate, based on information in the real-time statistics tables in the catalog, the size of a hash-organized table's hash area (this is the fixed-size area into which row locations are mapped via the hashing algorithm). When REORG does this, it goes for a hash space size that will result in 5-10% of the table's rows going into the overflow area. If you want even fewer rows in the overflow area, you can increase the size of the table's hash space via ALTER TABLE (and a REORG to put the change into effect), but this will of course consume more disk space. You can use the real-time stats tables yourself to keep an eye on overflow space usage, and schedule REORGs accordingly (TOTALROWS in SYSTABLESPACESTATS tells you how many rows are in the hash-organized table, and TOTALENTRIES in SYSINDEXSPACESTATS tells you how many entries there are in the hash-organized table's overflow index).
The other downside to hash-organized tables is degraded performance for sequential data access. This is to be expected, since rows in the table are unclustered in the extreme if you think of them in terms of some clustering key. As efficient sequential access tends to be important for the performance of queries that have relatively large result sets, hash-organization of data is clearly not the best choice for all tables. In fact, my thinking at this early stage of the game is that cluster-organization of data will continue to be the right choice for most of your tables, with hash organization being appropriate for selected tables for which optimal performance for queries with single-row result sets qualified by "equals" predicates referencing a particular unique key is of prime importance (and this performance benefit extends to searched updates and deletes, as well). Hash organization of data in a table is not a DB2 physical database design panacea, but it sure is great to have in your tool box. Check it out and decide where (or if) this capability could be beneficially leveraged in your environment.
This is the blog of Robert Catterall, an IBM Db2 for z/OS specialist. The opinions expressed herein are the author's, and should not be construed as reflecting official positions of the IBM Corporation.
Friday, November 30, 2012
Friday, November 9, 2012
DB2 for z/OS Data Sharing: Why CF LPARs need DEDICATED Engines
I was recently involved in an interesting DB2 for z/OS performance analysis effort. An organization had moved a DB2 data sharing group from one Parallel Sysplex mainframe cluster to another (if you aren't familiar with DB2 data sharing, you can check out an introductory blog entry that I posted a few years ago while working as an independent consultant). Right after that move, the CPU consumption associated with the DB2 workload on the system jumped by about 30%. DBAs diligently searched for the cause of this DB2 CPU cost spike. Was it related to an increase in the volume of DB2-accessing transactions executing in the data sharing group? No -- the size of the workload was about the same after the relocation of the DB2 subsystems as it had been before. Had SQL access paths changed, causing an increase in GETPAGE activity (something else I blogged about in my independent consultant days)? No -- GETPAGEs per transaction had stayed steady. A friend of mine supporting this system asked me: what could it be? How is it that the same workload is chewing so many more CPU cycles?
I had a hunch, and asked to see coupling facility activity reports (generated by z/OS monitor software) from before and after the relocation of the DB2 data sharing group. Sure enough, service times for synchronous requests to the DB2 lock structure and group buffer pools increased dramatically from the "before" to the "after" period -- I'm talking 300% - 500% increases. Here's why that matters: DB2 data sharing technology delivers major benefits in the areas of availability and scalability, but those benefits are not free. One cost factor is the increase in CPU consumption related to SQL statement execution in a data sharing system versus a standalone DB2 environment. That cost, which can be thought of as the CPU overhead of data sharing, is generally rather low -- typically about 10% in a well-tuned system. A very important characteristic of a well-tuned DB2 data sharing system is extremely rapid servicing of synchronous requests to the DB2 lock structure and the group buffer pools (GBPs) that are housed in coupling facility (CF) LPARs (another CF structure that supports DB2 data sharing, called the shared communications area, is accessed much less frequently than the lock structure and the group buffer pools, and so is usually not discussed in a performance context). The vast majority of requests to the lock structure and the GBPs are synchronous, and for performance purposes it is imperative that these requests be serviced with great speed, because when it's synchronous CF requests we're talking about, elapsed time directly impacts CPU time.
"How is that?" you might ask. "Why would the elapsed time of a CF request affect the CPU cost of that request?" I'll tell you. A synchronous CF request is not like a disk I/O request (or an asynchronous CF request), in that a mainframe engine driving an I/O request can go and do other things while waiting for the I/O request to complete (an interrupt will let the system know when an I/O operation has been completed, and any task that was suspended awaiting the completion of the I/O will then be eligible for re-dispatching). In the case of a synchronous CF request (such as a request for a global X-lock on a table page that is to be updated), the mainframe engine driving the CF will do nothing until the request has completed. The waiting mainframe engine is conceptually similar to a runner in a relay race awaiting the hand-off of the baton from a teammate. This special type of wait situation is known as processor "dwell" time. The cost of "dwell" time -- the waiting engine is, for all practical purposes, busy while it's dwelling -- is charged to the application process for which the synchronous CF request is executed. The result is higher CPU consumption for SQL statement execution. Fortunately, as previously mentioned, this added CPU cost is typically quite modest. Data sharing CPU overhead tends to be low in large part because synchronous requests to the DB2 lock structure and GBPs are usually serviced VERY quickly -- it's common to see service times for synchronous requests to GBPs that average in the low double digits of microseconds, and synchronous lock structure requests are sometimes serviced in the single digits of microseconds, on average. That's a good two orders of magnitude better than the service time for a DB2 disk read -- even a read from disk controller cache.
So, you can imagine what would happen if synchronous CF request service times for a DB2 data sharing group were to jump by several hundred percent: processor dwell time would shoot up, and so would the CPU consumption of the DB2 workload. This is what happened in the situation to which I've referred in this blog entry. Having established that, I then looked to identify the cause of the elongated synchronous CF request service times. Again on a hunch, I looked in the CF activity reports that I'd received for the section with information about the configuration of the CF LPARs on the Parallel Sysplex. There it was. In the "before" report, for each of the two CF LPARs in the Sysplex, I saw the following:
LOGICAL PROCESSORS: DEFINED 1 EFFECTIVE 1.0
SHARED 0 AVG WEIGHT 0.0
And this is what the same part of the "after" report looked like:
LOGICAL PROCESSORS: DEFINED 1 EFFECTIVE 0.9
SHARED 1 AVG WEIGHT 90.0
Bingo. Now, why is that slight change such a big deal? The CF LPARs were both running at a low average rate of utilization (under 10%), so wouldn't it be OK to "split" an engine between production and test CF LPARs, with the production CF LPAR getting 90% of that engine's processing capacity? NO. z/OS LPARs (and z/VM LPARs) can share mainframe engines very effectively with minimal overhead. Not so coupling facility control code (CFCC), the operating system of a CF LPAR. To get good performance in a high-volume environment (and with production DB2 for z/OS data sharing groups you're likely to see thousands of synchronous CF requests per second), it is VERY important that the CF LPARs have dedicated engines. I communicated this message to the DB2 and z/OS teams at the organization that had been grappling with the problem of elevated DB2-related CPU utilization, and they implemented a change that effectively gave the production CF LPARs one DEDICATED engine apiece. The result? CPU utilization associated with the DB2 application workload dropped dramatically -- to a level that was even lower than it had been prior to the relocation of the DB2 data sharing group from the one Parallel Sysplex to the other (lower rather than equal, because with the relocation of the data sharing group there was also a move to internal versus external CF LPARs, and a large percentage of the synchronous CF requests were between a z/OS LPAR and a CF LPAR on the same mainframe "box" -- these requests flowed over virtual CF links that deliver tremendous performance).
The moral to this story: for your production CF LPARs, make sure that engines are DEDICATED, not shared.
With all this said, I'll offer one caveat, of which Gary King, one of IBM's top System z performance experts, recently reminded me. IF a test Parallel Sysplex has a REALLY LOW level of CF request activity, it MIGHT be possible to let production CF LPARs share engines with the test CF LPARs and still get good synchronous CF request service times in the production environment (though still not as good as you'd see with dedicated CF LPAR engines) IF you turn dynamic CF dispatching (DCFD) ON for the test CF LPARs and leave DCFD OFF (the default) for the production CF LPARs (DCFD is a CF configuration option). In that case, the test CF LPARs would generally use around 1% of the CF engines (even if you had the engine weightings split 90% for production and 10% for test). The test CF LPARs would probably get lousy service times for synchronous CF requests, but that could be OK in a low-volume test environment. There is a risk associated with this approach: if the test Parallel Sysplex got to be fairly busy, so that the test CF LPARs more fully utilized their 10% slice of the shared CF engines, the production CF LPARs would be negatively impacted, performance-wise. The safest route, then, for a production Parallel Sysplex is to go with dedicated engines for the CF LPARs (by the way, internal coupling facility engines, or ICFs, do not impact the cost of your z/OS software running on the same mainframe).
Lots of us were taught as kids that sharing is a good thing. As I've pointed out just now, there is at least one exception to that rule.
I had a hunch, and asked to see coupling facility activity reports (generated by z/OS monitor software) from before and after the relocation of the DB2 data sharing group. Sure enough, service times for synchronous requests to the DB2 lock structure and group buffer pools increased dramatically from the "before" to the "after" period -- I'm talking 300% - 500% increases. Here's why that matters: DB2 data sharing technology delivers major benefits in the areas of availability and scalability, but those benefits are not free. One cost factor is the increase in CPU consumption related to SQL statement execution in a data sharing system versus a standalone DB2 environment. That cost, which can be thought of as the CPU overhead of data sharing, is generally rather low -- typically about 10% in a well-tuned system. A very important characteristic of a well-tuned DB2 data sharing system is extremely rapid servicing of synchronous requests to the DB2 lock structure and the group buffer pools (GBPs) that are housed in coupling facility (CF) LPARs (another CF structure that supports DB2 data sharing, called the shared communications area, is accessed much less frequently than the lock structure and the group buffer pools, and so is usually not discussed in a performance context). The vast majority of requests to the lock structure and the GBPs are synchronous, and for performance purposes it is imperative that these requests be serviced with great speed, because when it's synchronous CF requests we're talking about, elapsed time directly impacts CPU time.
"How is that?" you might ask. "Why would the elapsed time of a CF request affect the CPU cost of that request?" I'll tell you. A synchronous CF request is not like a disk I/O request (or an asynchronous CF request), in that a mainframe engine driving an I/O request can go and do other things while waiting for the I/O request to complete (an interrupt will let the system know when an I/O operation has been completed, and any task that was suspended awaiting the completion of the I/O will then be eligible for re-dispatching). In the case of a synchronous CF request (such as a request for a global X-lock on a table page that is to be updated), the mainframe engine driving the CF will do nothing until the request has completed. The waiting mainframe engine is conceptually similar to a runner in a relay race awaiting the hand-off of the baton from a teammate. This special type of wait situation is known as processor "dwell" time. The cost of "dwell" time -- the waiting engine is, for all practical purposes, busy while it's dwelling -- is charged to the application process for which the synchronous CF request is executed. The result is higher CPU consumption for SQL statement execution. Fortunately, as previously mentioned, this added CPU cost is typically quite modest. Data sharing CPU overhead tends to be low in large part because synchronous requests to the DB2 lock structure and GBPs are usually serviced VERY quickly -- it's common to see service times for synchronous requests to GBPs that average in the low double digits of microseconds, and synchronous lock structure requests are sometimes serviced in the single digits of microseconds, on average. That's a good two orders of magnitude better than the service time for a DB2 disk read -- even a read from disk controller cache.
So, you can imagine what would happen if synchronous CF request service times for a DB2 data sharing group were to jump by several hundred percent: processor dwell time would shoot up, and so would the CPU consumption of the DB2 workload. This is what happened in the situation to which I've referred in this blog entry. Having established that, I then looked to identify the cause of the elongated synchronous CF request service times. Again on a hunch, I looked in the CF activity reports that I'd received for the section with information about the configuration of the CF LPARs on the Parallel Sysplex. There it was. In the "before" report, for each of the two CF LPARs in the Sysplex, I saw the following:
LOGICAL PROCESSORS: DEFINED 1 EFFECTIVE 1.0
SHARED 0 AVG WEIGHT 0.0
And this is what the same part of the "after" report looked like:
LOGICAL PROCESSORS: DEFINED 1 EFFECTIVE 0.9
SHARED 1 AVG WEIGHT 90.0
Bingo. Now, why is that slight change such a big deal? The CF LPARs were both running at a low average rate of utilization (under 10%), so wouldn't it be OK to "split" an engine between production and test CF LPARs, with the production CF LPAR getting 90% of that engine's processing capacity? NO. z/OS LPARs (and z/VM LPARs) can share mainframe engines very effectively with minimal overhead. Not so coupling facility control code (CFCC), the operating system of a CF LPAR. To get good performance in a high-volume environment (and with production DB2 for z/OS data sharing groups you're likely to see thousands of synchronous CF requests per second), it is VERY important that the CF LPARs have dedicated engines. I communicated this message to the DB2 and z/OS teams at the organization that had been grappling with the problem of elevated DB2-related CPU utilization, and they implemented a change that effectively gave the production CF LPARs one DEDICATED engine apiece. The result? CPU utilization associated with the DB2 application workload dropped dramatically -- to a level that was even lower than it had been prior to the relocation of the DB2 data sharing group from the one Parallel Sysplex to the other (lower rather than equal, because with the relocation of the data sharing group there was also a move to internal versus external CF LPARs, and a large percentage of the synchronous CF requests were between a z/OS LPAR and a CF LPAR on the same mainframe "box" -- these requests flowed over virtual CF links that deliver tremendous performance).
The moral to this story: for your production CF LPARs, make sure that engines are DEDICATED, not shared.
With all this said, I'll offer one caveat, of which Gary King, one of IBM's top System z performance experts, recently reminded me. IF a test Parallel Sysplex has a REALLY LOW level of CF request activity, it MIGHT be possible to let production CF LPARs share engines with the test CF LPARs and still get good synchronous CF request service times in the production environment (though still not as good as you'd see with dedicated CF LPAR engines) IF you turn dynamic CF dispatching (DCFD) ON for the test CF LPARs and leave DCFD OFF (the default) for the production CF LPARs (DCFD is a CF configuration option). In that case, the test CF LPARs would generally use around 1% of the CF engines (even if you had the engine weightings split 90% for production and 10% for test). The test CF LPARs would probably get lousy service times for synchronous CF requests, but that could be OK in a low-volume test environment. There is a risk associated with this approach: if the test Parallel Sysplex got to be fairly busy, so that the test CF LPARs more fully utilized their 10% slice of the shared CF engines, the production CF LPARs would be negatively impacted, performance-wise. The safest route, then, for a production Parallel Sysplex is to go with dedicated engines for the CF LPARs (by the way, internal coupling facility engines, or ICFs, do not impact the cost of your z/OS software running on the same mainframe).
Lots of us were taught as kids that sharing is a good thing. As I've pointed out just now, there is at least one exception to that rule.