Thursday, December 30, 2010

DB2 for z/OS: KEYCARD Gets its Due

By now, you've probably seen and/or heard a good deal of information about DB2 10 for z/OS, which was announced and became generally available this past October. There is indeed a lot of big news associated with this latest release of IBM's mainframe relational database management system: reduced CPU costs, support for temporal data (tables with system and/or business time dimensions), a huge increase in the number of threads that can be concurrently active, a migration path to universal tablespaces, SQL Procedure Language enhancements (better performance, plus the ability to write user-defined functions using SQLPL), table-based data access control policies, and lots more (and I do mean "lots" -- check out the DB2 10 "What's New" manual). Amongst all the "wow" features of DB2 10 are some nice little nuggets of functionality that will positively impact system performance and administration. One of these -- and the subject of this blog entry -- concerns the KEYCARD option of the RUNSTATS utility.

Here's the story in a nutshell: with DB2 10, the KEYCARD option is no longer optional. That is to say, if you run the RUNSTATS utility with an INDEX specification in a DB2 10 environment, and you have indexes with more than two key columns defined on the table (or tables) in the target tablespace, you will get the catalog statistics associated with KEYCARD, regardless of what you specify on the utility control statement. You can still put KEYCARD in this statement, but the keyword will be ignored because it "is now built into the normal execution of the RUNSTATS INDEX utility and cannot be disabled." Trust me: this is a good thing. It's a change that leaders on the optimizer team in IBM's DB2 development organization have lobbied for for quite some time. I'll explain why, and I'll tell you what this means for you if you're not yet running DB2 10 (and that's most of you, as, again, DB2 10 has only been generally available for a couple of months).

First, a little background. KEYCARD (which is only valid in the context of an INDEX specification on a RUNSTATS utility control statement) goes way back. I'm not sure when it was introduced as a RUNSTATS option, but I think that it might have been delivered with DB2 for z/OS Version 4 (mid-1990s). What does it do? Pretty simple: it causes RUNSTATS to determine the number of distinct values of combinations of an index's first "n" key columns, where "n" is greater than 1 and less than the total number of the index's key columns (I say this because the cardinality -- the number of distinct values -- of the index's first key column and of the full key are gathered anyway and placed in the FIRSTKEYCARDF and FULLKEYCARDF columns, respectively, of the SYSIBM.SYSINDEXES catalog table). The values obtained via the KEYCARD specification are placed in the SYSIBM.SYSCOLDIST table.

An example might be helpful here. Suppose you have a 10,000-row table containing data about customers in a particular country (one row per customer), and on that table you have an index with a key comprised of the columns STATE, CITY, and POSTAL_CODE. Suppose further that you have customers in fifty cities, with each city being in a different state (i.e., one city in each of 50 different states -- admittedly, this is a rather contrived example). Finally, assume that there are 200 different zip code values in the table, and that the duplicate values of STATE, CITY, and POSTAL_CODE are evenly spread across the table's rows (for information on non-uniform distribution of non-unique column values, check out an entry I posted last year on that topic in my old Catterall Consulting blog). Without KEYCARD specified, an execution of RUNSTATS will generate (as previously mentioned) FIRSTKEYCARDF (50) and FULLKEYCARDF (200) statistics for the index. With KEYCARD specified, RUNSTATS will also determine the number of distinct values of the index's first two columns: in this case, that's 50 -- the same as the FIRSTKEYCARDF value (if the index had 4 columns, KEYCARD would result in a determination of the number of distinct values of the combination of key columns 1, 2, and 3, as well as the number of distinct values of key columns 1 and 2).

Why is getting this statistical data important? Two words: column correlation. See, if you use what is a fairly common "base" RUNSTATS control statement, namely, TABLE(ALL) INDEX(ALL), DB2 will gather cardinality stats for every column of every table in the target tablespace. It will know, then, that there are 50 distinct values in the STATE column and 50 distinct values in the CITY column. What DB2 doesn't know is that the STATE and CITY values are highly correlated (CITY value 'Los Angeles' is paired with 'California', but not with any other value in the STATE column). Not knowing this, DB2 assumes that the values in the CITY column are independent of STATE values. Consider a query with the compound predicate below:

WHERE STATE = 'California'
AND CITY = 'Los Angeles'

Absent column correlation data, DB2 will estimate the number of qualifying rows by multiplying the cardinality of the STATE column by the cardinality of the CITY column; so, the estimate will be (1/50) * (1/50) * 10,000 rows = 4 rows. In fact, because STATE and CITY values are very highly correlated, the number of qualifying rows will be (1/50) * 10,000 = 200 (again, I'm assuming a uniform distribution of duplicate STATE and CITY values in the table's rows). Lack of correlation stats for the STATE and CITY columns results in a filtering estimate that is off by a factor of 50. When a filtering estimate is way off from reality, the access path chosen by the DB2 optimizer for a query could be sub-optimal, and that path may cause the query to run a lot longer than it needs to. KEYCARD gives DB2 more column correlation information to use in estimating the number of rows qualified by a query's predicates, and better information leads to better access path choices (and better query performance, which of course is the bottom line).

So, why not have RUNSTATS gather correlation stats for all combinations of all columns in a table? That could be done, using COLGROUP specifications for groups of non-indexed columns, but the CPU and run-time costs of doing this would be very high unless the table had very few columns and not many rows (plus, coding all the COLGROUP specifications for a table with a lot of columns would be extremely tedious). In most cases, it just isn't a practical option. Making KEYCARD automatic in DB2 10 is a good idea because it generates data that has significant query optimization value (additional column correlation information for indexes with keys comprised of more than two columns) and does so at a low cost in terms of RUNSTATS CPU and elapsed time (this because the column groups for which correlation data is obtained are "leading and contiguous" subsets of multi-column index keys, making the correlation data cheaply available by way of a table's indexes, in which key values are always strictly ordered).

I'll conclude with a word to the wise: don't wait for DB2 10 to make KEYCARD automatic in your environment. Go ahead and add this option to your RUNSTATS INDEX specifications. If, for example, your basic RUNSTATS control statement includes

TABLE(ALL) INDEX(ALL)

I'd recommend changing that to

TABLE(ALL) INDEX(ALL) KEYCARD

You'll enrich your catalog stats at a low incremental cost, and you'll have taken a step towards smoothing the path to DB2 10, as you'll have in your catalog the correlation stats that will be automatically generated in the DB2 10 environment.  

1 comment:

  1. Excellent article. I bumped into this value in the results of explain and even ventured a little into DB2 (LUW) knowledge center before seeing this.

    Thank you for putting time into it!

    ReplyDelete