Monday, September 30, 2013

DB2 for z/OS Hash-Organized Data: an Interesting Performance Story

A lot of DB2 for z/OS people know about the hash mode of organizing data in a table -- something introduced with DB2 10 (I've blogged about this topic multiple times, most recently in an entry I posted a few months ago). The basics of hash-organization of data (this as opposed to traditional cluster-based data organization) are pretty straightforward: you select for a table a hash key (which could be comprised of a single column or a concatenation of multiple columns, as long as each key value is unique), and you tell DB2 to hash-organize the data in the table using the designated key (this can be done with a CREATE TABLE or an ALTER TABLE statement -- in the latter case a subsequent online REORG of the associated table space changes the mode of data organization from cluster-based to hash-based). Thereafter, when a row is inserted into the table, DB2 will determine the target page for the row by running the value of the hash key for the row through a hashing algorithm. The really good part, performance-wise, comes when a row is retrieved from the table. If a SELECT statement includes an "equals" predicate that references the table's unique hash key, DB2 can run that hash key through the aforementioned hashing algorithm to identify the page into which the row was assigned on insert, and voila -- the row is returned to the requesting application process with a single GETPAGE (possibly some rows in a hash-organized table will be placed in the overflow area of the table space at insert if the hash-identified target page is full, but the percentage of such rows in the table should be small if the table space was properly sized). In contrast, if DB2 were using a unique index to access a row then multiple GETPAGE operations would be necessary (e.g., three GETPAGEs for a three-level index, followed by one more GETPAGE for the table space page identified via the row ID found in the index leaf page containing the key value referenced in an "equals" predicate in a query). Because GETPAGEs are a key determinant of the CPU cost of executing a SQL statement, the cost of accessing a row in a hash-organized table will, in some cases, be less than the cost of accessing the same row in a cluster-organized table.

When will a hash-organized table be tops in terms of efficiency of data access? Your initial response to that question might be, "A hash-organized table wins the CPU efficiency game when a query retrieves a single row qualified by an 'equals' predicate that references the table's hash key." That answer will generally be right, but it won't always be right, as illustrated by an interesting situation recently brought to my attention by a DB2 for z/OS DBA. This DBA informed me of a test he ran in a DB2 10 new-function mode environment. The test involved a program that issued a query that targeted a hash-organized table and contained an "equals" predicate referencing the table's hash key. The DBA ran the same program using the same data, with the only difference being cluster-based organization of the data in test B instead of the hash organization used in test A. Guess what? The program's elapsed and CPU times were lower when the target table was cluster-organized. Huh? How could that be?

The key to what at first appears to be a strange result is this: the program executed by the DBA issued the singleton SELECT with the hash key-referencing "equals" predicate in a loop, with the value plugged into the "equals" predicate picked up, for each execution of the query, from a file -- a very common batch scenario in the mainframe DB2 world. On top of that, the file of key values was sorted in a way that pretty much lined up with the clustering sequence of the target table (referring to the case in which the program was executed with a cluster-organized table). This is also very common in DB2 for z/OS systems. With the input file of key values sorted in this way, execution of the program with a cluster-organized table resulted in a sequential page access pattern: each successive table space page accessed was "ahead of" (with respect to physical order) and "nearby" (generally speaking, within 16 pages) of the previously accessed page. Through the mechanism known as sequential detection, DB2 recognized this sequential data access pattern and activated dynamic sequential prefetch for the program, significantly reducing elapsed time, and saving CPU time, as well, relative to the random synchronous read activity attending the execution of the program with the hash-organized table. Additionally, repeated execution of the singleton SELECT with the cluster-organized table allowed DB2 to utilize index look-aside to dramatically reduce index-related GETPAGE activity, thereby largely negating the GETPAGE minimization advantage that would otherwise be seen with access to a hash-organized table (note that, for a batch program that issues many COMMITs, the effectiveness of sequential detection and index look-aside is maximized when the associated package(s) is bound with RELEASE(DEALLOCATE)). Thus, in this case, two old DB2 performance features (I believe that both sequential detection and index look-aside were introduced with DB2 V2.3, in the early 1990s) trumped one very new one (hash-organized data).

Now, this isn't the end of the story. The DBA ran the test again, this time with the input file of key values sorted in a random fashion that didn't line up at all with the sequencing of rows in the cluster-organized table. The results of this second test were the reverse of what was seen the first time around: the performance of the program was better when it executed with the hash-organized table. No surprise there. With sequential detection and index look-aside out of the picture, the one-GETPAGE-per-row aspect of hash-organized data access beat out the many top-to-bottom index probes and random reads of table and index pages that came with access to the cluster-organized table.

So, in weighing whether or not a particular table should be hash- or cluster-organized, do not check to see only that the table is primarily accessed by singleton SELECTs that qualify rows with an "equals" predicate referencing what would be the table's hash key. Consider, as well, where those singleton SELECTs come from. Are they mostly issued by online transactions that retrieve just one row (or a very small number of rows) from the table with each execution, or are they chiefly issued in do-loop fashion by batch programs that pull lots of rows from the table, using files of input keys that can be sorted to match (or highly correlate with) what is (or would be) the table's clustering key? If the latter holds true, the oldie-but-goodie batch performance optimization features, sequential detection and index look-aside, might provide even greater CPU efficiency (and lower elapsed times) that you could get with a hash-organized table. If the singleton SELECTs are, successively speaking, more random in terms of data access, hash-organizing the data could be your best choice for optimal performance. What's true now has been true as long as I've been working with DB2 for z/OS (a long time): good physical database design decisions proceed from knowledge of your data and of how that data is accessed. 'Nuff said.

No comments:

Post a Comment