Friday, October 1, 2010

Better DB2 for z/OS Query Performance through Statistics

It's been longer than I like since my last post -- a reflection of a particularly busy schedule of late. Today I'd like to focus on catalog statistics as a means of improving query performance in a DB2 for z/OS environment.

When a DB2 query doesn't perform as you want it to (response time and/or CPU time is too high), your first move is often, I'm sure, a review of the query's access path information -- something you'd get either from EXPLAIN data in the PLAN_TABLE, or through a graphical display of the query's access plan (among the IBM tools that can generate such a display are Optimization Service Center, Data Studio, and Optim Query Tuner). Analysis of this information could lead you to think of two common remedial actions: add a new index (or add a column or columns to an existing index) on a table targeted by the query; or, modify the query text in order to (for example) make a stage 2, non-indexable predicate stage 1 and indexable. These steps do often yield very positive results, but sometimes neither one is feasible.

Take the add (or alter) index option. Indexes are never free -- they take up disk space (though less with the index compression capability of DB2 9), and they make row insert and row delete operations more CPU-costly (they also increase the cost of updates when those updates change the value of indexed columns). Sometimes, the query performance benefits offered by a new index are seen as outweighing the index's costs. In other cases, creation of a new index on a table (or even adding a column to an existing index) has little chance of approval. This could be the case if there are already a lot of indexes defined on the table in question (I've seen upwards of 20 on a single table). Even if a table has only a few indexes, the idea of adding another might get a lot of push-back if performance-critical, data-changing batch jobs or online transactions that access the table can't be made any more expensive, for fear of missing SLA targets (sometimes an issue for batch workloads) or bottlenecking throughput (a potential concern for a high-volume transactional application).

If adding an index to a table is not do-able, what about modifying the query statement text to get a better-performing access path? That, too, can be a choice that's not available to you. Why? Two words: query tool. If the poorly performing query gets generated by a query tool on a user's PC, you may have to deal with the SQL statement as-is. Even if the query is in an application program that was developed in-house at your organization, changing the statement text may be more easily said than done. The developer who could make the change might be swamped with other work, and/or you might need to get a change request approved and that could take a good bit of time (depending on what else is on development's plate).

When a new index is unlikely and statement modification is not possible, are you at a dead end in terms of improving a query's performance? NO. Instead of giving up, try the catalog statistics route. A little clarification here: I'm not talking so much about making sure that statistics for the tables referenced by the query (and indexes on these tables) are accurate -- keeping those current through regular execution of the RUNSTATS utility is DB2 101. What I'm referring to is the richness of the statistics in your DB2 catalog. See, those statistics are the primary input to the DB2 optimizer when it is generating an access plan for an SQL statement. With more comprehensive information, the optimizer is more likely to choose the access path for a query that actually IS the best-performing path, versus one it THOUGHT was lowest-cost but which turned out to be high-cost.

So, what constitutes "more comprehensive" statistics? What I have in mind are value distribution statistics for columns that have skewed duplicate values. Here's what I mean by that: with just basic catalog statistics in place for a table (more on this momentarily), DB2 has two items of statistical information pertaining to the values stored in a column of the table: the column cardinality (i.e., the number of distinct data values in the column) and the number of rows in the table. For a unique column, cardinality will be equal to the number of rows in the table. For a non-unique column, cardinality will be less than the number of rows in the table, because duplicate values of the column will appear in some rows. Absent additional statistical information about the column, DB2 will assume that duplicate column values are evenly spread across the table's rows. For example, suppose that table ABC has 1 million rows, and column XYZ of that table has a cardinality of 1000. If that's all the statistical information that DB2 has for the column, it will assume that the 1000 distinct values of column XYZ are evenly spread across the 1 million rows of table ABC. In other words DB2 will assume that each of the 1000 values of column XYZ appears in 1000 of table ABC's rows. What if the spread of values in column XYZ is very much skewed? What if one value of the column appears in 900,000 of the 1 million rows in table ABC? And suppose that the next-most-frequently occurring value of column XYZ appears in 90,000 of the remaining 100,000 rows in table ABC? That disconnect between DB2's assumption (even spread of duplicate column values) and the reality of the situation (highly skewed column value distribution) can lead to poor-performing access path choices.

Fortunately, that "reality gap" can be filled via a couple of options that can be specified on a DB2 RUNSTATS utility control statement: FREQVAL and HISTOGRAM (the latter was introduced for mainframe DB2 via DB2 9 for z/OS -- I blogged about it last year when I was working as an independent DB2 consultant). With FREQVAL, you can tell the RUNSTATS utility to gather information about the top "n" most frequently occurring (or least frequently occurring, or both) values in a column (10 is a popular choice for "n" here, but you can go with another integer value if you'd like). This information (for each of the top "n" values, the value itself and the percentage of table rows containing that value in the specified column) is placed in the catalog table SYSIBM.SYSCOLDIST (or SYSCOLDISTSTATS in the case of a partitioned table). When the HISTOGRAM option is used, DB2 divides the columns values into "quantiles" (each a range of column values) and determines the number of unique column values within each quantile (the different quantiles cover about the same number of table rows). Note that while I've referred to individual columns, you can optionally have RUNSTATS generate FREQVAL or HISTOGRAM stats for a group of columns. In any case, with the value-frequency-distribution information gathered via FREQVAL, the DB2 query optimizer can much more accurately estimate the cost of candidate access paths for queries that have "equal" or range-type predicates (>, <, >=, <=) that reference non-unique columns. HISTOGRAM statistics are particularly useful for queries that have BETWEEN predicates that reference non-unique columns. More accurate access path cost estimation is the key to choosing the path that will actually deliver the best performance for a query.

OK, so armed with this information, what do you do? Should you have RUNSTATS generate FREQVAL and/or HISTOGRAM statistics for every non-unique column of every table in your database (and all the combinations thereof, while you're at it)? Sure, if your company gets mainframe cycles for free. A better approach is to get a good baseline of statistics in the catalog, and then to enrich them further on an as-needed basis. In my opinion, a good "base" RUNSTATS control statement for a RUNSTATS TABLESPACE utility would include these options:

TABLE(ALL) INDEX(ALL) KEYCARD

Note that this will get you, by default, "top 10" FREQVAL stats for the first key column of each index on every table in the tablespace.

So, you have your base of statistics (which you keep up-to-date, right?) in the catalog, and you have a query for which DB2 seems to have chosen the wrong access path. Now what? Well, if you're interested in productivity, take that query, plug it into the Optimization Service Center for DB2 for z/OS (free and downloadable) or Data Studio or Optim Query Tuner, and run the Statistics Advisor function. You'll get recommended RUNSTATS control statement specifications for the query, and these could well include the FREQVAL or HISTOGRAM options. Run RUNSTATS as recommended by the Statistics Advisor (hint: the recommendations labeled as "high priority" tend to give you the biggest bang for the buck), re-run the query (rebind the query-issuing program first, if it's a static SQL statement), and see if you get the better-performing access path that you think should have been chosen by DB2 in the first place. Armed with richer statistics, the optimizer can be expected to make better access path choices.

I took the approach described in the preceding paragraph a few months ago when I was helping an organization with some DB2 query tuning work: we had a query that was running long and chewing up a lot of CPU time, and our analysis of the access plan chosen for the query (we used Optimization Service Center for this purpose) showed that DB2 was significantly under-estimating the number of result set rows that would be qualified by a certain predicate (OSC and the other tools I've mentioned are great for this purpose -- they show you DB2's estimates of result set row filtering at each step of a query's access path). This, in turn, appeared to cause the optimizer to choose what turned out to be a high-cost access path for the query. We ran the Statistics Advisor for the query, and saw that the recommended RUNSTATS control statement included the FREQVAL option for some of the columns referenced in the query's predicates. We executed RUNSTATS as recommended, reran the query (it was a dynamic SQL statement), and saw that elapsed time decreased by about 50% (we did in fact get a different access path). That was great: a major query performance enhancement, without a new index and without query statement text modification. We just gave DB2 more information about the data in the target table, and the optimizer took it from there. Give this approach a try sometime, and write your own success story.

5 comments:

  1. This information is very good "food for thought" as I am quite certain I don't have a sufficient understanding of DB2 statistics. We schedule RUNSTATS to run on a regular basis and I have definitely seen an improvement in SQL performance after the RUNSTATS completes, but I’m often left wondering whether the performance could be even better. Not having to make structural or code changes certainly has its advantages and an area in which I’d like to further investigate.

    ReplyDelete
  2. I'm glad that you found this information to be useful. With regard to your comment that you're "often left wondering whether the performance could be even better," that's precisely the mind set that would motivate you to use the Statistics Advisor functionality that is built into several of IBM's query analysis tools. In the blog entry I mentioned the IBM Optimization Service Center tool, but I'd recommend using Data Studio instead, as it's the more strategic go-forward solution. You can download Data Studio for free: go to http://www-01.ibm.com/software/data/optim/data-studio/ and click on the "Free download" link.

    ReplyDelete
  3. Hi Robert - Thanks for the details but I would like to know which data studio client I should download.The link is taking to below navigation
    http://www.ibm.com/developerworks/downloads/im/data/
    There is option like below
    Data Studio client and
    3rd-party product extensions*
    Red Hat Linux®**, SUSE Linux**, Windows™
    Shall I download above for Z/OS mainframe system ...I am confused as it is mentioned for window,linux.
    Also how do i simulate the same with my production subsystem
    Thanks a lot for your help

    ReplyDelete
    Replies
    1. I would download the "Data Studio client and 3rd-party extensions for Red Hat Linux, SUSE Linux, and Windows."

      Windows and Linux are mentioned because Data Studio runs under one of these operating systems, generally on one's PC. You can use Data Studio to access DB2 for z/OS data (and to do a number of other things, such as tune individual queries and develop DB2 SQL stored procedures), but it doesn't run under z/OS. Data Studio comes with built-in DB2 Connect code that enables one to establish a connection from Data Studio to a DB2 for z/OS data server.

      Robert

      Delete