Friday, August 30, 2013

DB2 for z/OS: GETPAGEs are not ALWAYS the Key Determinant of SQL Statement CPU Time

In my DB2 for z/OS-related work over the past 25 years, I've done a lot in the way of performance analysis, looking at individual queries and at DB2-accessing application programs (both transactional and batch in nature). Time and time again, sometimes in reviewing trend data and sometimes in examining DB2 monitor output to gauge the impact of performance tuning actions, I've seen that the CPU time consumed in SQL statement execution rises or falls with the associated level of GETPAGE activity (a GETPAGE is essentially a request by DB2 to examine a page from a table or an index). The correlation I observed between CPU cost and GETPAGEs for SQL statements was so high that I came to see GETPAGE activity as the number one determinant of a query's CPU consumption in a mainframe DB2 system (and I declared as much in a blog entry that I wrote a few years ago while working as an independent DB2 consultant).

Well, a lot of rules have exceptions, and that turned out to be the case for this rule of thumb. I got some information from a DB2 for z/OS DBA that turned my notion of the relationship between GETPAGEs and CPU cost for SQL statements on its head. The data I saw seemed at first to be very strange, but as I dug deeper it started to make sense. I still see GETPAGEs as being, more often than not, the most important factor with regard to a query's CPU cost, but it's not the only factor, and sometimes other aspects of SQL statement execution exert a greater influence on CPU consumption. In this blog entry I'll show you why GETPAGEs are not always the strongest flavor in the mix of determinants that influence CPU consumption for DB2 for z/OS-accessing queries.

Interestingly, the case about which I'm writing involved data compression. That's a DB2 for z/OS feature that's been around a LONG time (around 20 years or so), and which is VERY widely used. Data compression is super-popular at mainframe DB2 sites because it saves a ton of disk space and it's virtually free, overhead-wise. Why is DB2 for z/OS data compression so CPU-efficient? Two reasons: first, it gets a hardware assist from System z servers (big difference versus a software-only implementations of data compression). The second reason is GETPAGE-related: because compression often allows you to get three to four times as many rows in a page versus a non-compressed table space, the number of GETPAGEs required to access data frequently drops considerably when a table space is compressed. Yes, even hardware-assisted data compression has a cost, but many DB2 for z/OS-using organizations have found that the reduction in GETPAGE activity that goes along with compression largely offsets the CPU cost of compressing and decompressing data rows; thus, the near-zero overhead effect of DB2 compression observed in many systems.

Imagine my surprise, then, when the aforementioned DBA presented me with DB2 monitor data showing the effect of data compression on a DB2 for z/OS-accessing query. I saw that the number of GETPAGEs associated with execution of the query dropped by 65% after the target table space was compressed. No surprise there. I also saw that the query's elapsed time dropped by a whopping 92%. Not a huge surprise there, either. The lion's share of elapsed time for many a query is wait-for-I/O time, and with many fewer pages having to be read from disk to obtain the query's result set, a much-reduced run time would be expected. Then came the not-expected result: the query's CPU time went up by 31% following compression of the target table space.

What?!? How could that be? How could you have a 65% reduction in GETPAGEs and and a 31% increase in CPU time? Sure, I wouldn't expect CPU time to go down by the percentage of GETPAGE reduction, because some CPU cost does get added when COMPRESS YES is put into effect for a table space. Still, though, I would have expected either a modest decrease in the query's CPU cost, or at worst a break-even situation, with the lower level of GETPAGE activity cancelling out the CPU expense of decompressing result set rows. This just didn't make sense to me.

My confusion started to dissipate as I learned more about the query in question, and thought more about what DB2 was doing in executing the query. The query targeted a table with about 3 million rows, and 600,000 of those rows had to be examined in order to generate the result set. The large number of rows needing evaluation was due to the fact that the primary search argument -- the one that really filtered out a lot of rows -- was a non-indexable LIKE predicate used to find a particular character string that could occur anywhere within a certain column of the table (the % in front of the LIKE made the predicate non-indexable; an index on the column could have been used in executing the query had there been only a % after the LIKE, versus the %LIKE% that enabled location of the character string anywhere in the referenced column). Another predicate in the query -- an indexable one -- filtered out 80% of the table's rows, but the remaining 600,000 rows had to be individually examined for the existence of the required character string in the referenced column, as noted.

Here's where the compression effect kicked in. See, when rows in a compressed page are evaluated against a query search argument, they are individually decompressed, one at a time. Sometimes (very often, in fact), an index can be used to identify the small number of qualifying rows that must be examined in a given table page (or simply retrieved from the page, if result set row qualification can be fully determined at the index level), and decompressing those few rows in the page will be a relatively small component of the SQL statement's overall CPU cost. Other times (and this was the case for the query that the DBA discussed with me), an index can be used to reduce the number of table pages requiring access in query execution, but in a given page there might be 30 or 40 or more rows that have to be decompressed and evaluated (and maybe retrieved). When decompression cost is driven by the expansion of 30 or 40 rows in a page, versus 3 or 4 rows, that cost can rise to the point that it exceeds the cost reduction associated with diminished GETPAGE activity for a compressed table space. That's exactly what happened in the situation I've described. Was the compression performance result reported by the DBA good or bad? That depends on whether the performance objective for the query was improved response time (in which case the big reduction seen for the query's elapsed time with the compressed table space would be a positive outcome) or lower CPU consumption (this particular query consumed less CPU time when the target table space was not compressed).

It all goes to show that rules of thumb are best thought of as guidelines, as opposed to absolutes. You still have to use your brain in analyzing query performance, and you have to be ready to dig below the surface of a situation. GETPAGE activity is indeed a very important factor in determining a query's CPU cost in a DB2 for z/OS system, and tuning actions that reduce GETPAGEs are generally beneficial with regard to SQL statement CPU efficiency. What you have to remember is that GETPAGE activity is one of several determinants of query performance. When DB2 has to do something to a lot of rows in each page accessed in the execution of a query -- things like decompression, value transformation via scalar functions, casting of one data type to another, arithmetic operations, etc. -- the CPU cost of GETPAGEs can become a relatively smaller part of the query performance big picture. Consider what DB2 has to do with the rows on a page -- and with how many of the rows on that page -- once it gets the page. There will be times when improving the CPU efficiency of a query is is not just a matter of reducing the query's GETPAGE count. Also, have a good understanding of your query tuning objective: is it reduced elapsed time, or reduced CPU consumption, or both?

Rules of thumb can be useful in guiding your query tuning efforts. Don't let them become blinders that prevent you from thinking outside of the box.

23 comments:

  1. Thanks Robert . This is useful as always.

    ReplyDelete
    Replies
    1. Thanks for the positive feedback. Glad you found the content to be of use.

      Robert

      Delete
  2. Good day,

    Thank you for the article. One question, if I may. What occurs if work file pages swap? Does the CPU take a 'hit' because due compression for each paged row? Please advise. thank you.

    ReplyDelete
    Replies
    1. Work file table spaces cannot be compressed.

      Robert

      Delete
  3. Hi Robert,
    Very interesting article.
    I have a question.
    I noticed for some jobs ( performing Delete/Updates)
    that number of Page updates for Buffer is much higher than overall number Getpage requests , like this :
    Buffer Pool: BP8K1

    Getpage Requests = 1750683 Failed Getpage Requests = 0
    Synchronous Read I/O = 9775 Getpage/Read I/O = 179.09
    Page Updates = 17645K Seq Prefetch Requests = 0
    List Prefetch Requests = 0 Dynamic Prefetch Requests = 227446
    Prefetch Pages Read = 1925267
    Immediate Writes = 0 BP Hit Percentage = 86.4%

    Do you know why it happens ? ( Delete is done sequentially in clustering key order)

    Regards Ilya

    ReplyDelete
    Replies
    1. Based on my understanding, it is no surprise, in some cases, to see buffer updates in excess of GETPAGEs. If a page contains 20 rows that are all updated by an application process, that would be 1 GETPAGE and 20 buffer updates. Note that "update" here does not necessarily mean SQL UPDATE statements - INSERT and DELETE statements drive buffer update activity, as well. Think of "buffer update" as "buffer change."

      Robert

      Delete
  4. Hi Robert,

    Could you please take a look at my post https://www.idug.org/discussion/db2-help-needed-to-understand-getpage-activity-in-a-delete-query
    on IDUG and provide your comments ?

    Thanks,
    Ravikumar

    ReplyDelete
    Replies
    1. I will provide my comments here, Ravikumar.

      You indicate that over 200,000 rows are deleted by the application process. How many rows are affected by each execution of the DELETE statement? Is it a single row per DELETE, so that the DELETE is executed 200,000+ times in a loop, or are a great many rows removed from the table with a single execution of the DELETE statement? If the DELETE is executed a great many times (i.e., if one execution of the DELETE removes only one row, or very few rows), what is the commit frequency? Are any commits issued as rows are being deleted by the application process? If commits are issued by the deleting application process, what is the RELEASE specification for the program's package? RELEASE(COMMIT)? RELEASE(DEALLOCATE)?

      Robert

      Delete
    2. Robert,

      Thank you so much for your reply!!
      The DELETE query was issued only "81" times. And it deleted 238,035 rows for all those "81" DELETE SQLs. So, it is deleting many rows for each DELETE. There is no COMMIT in my test program. The program was BIND with RELEASE(COMMIT). Could please let me know if you need any other information ?

      Thanks,
      Ravikumar

      Delete
    3. The fact that the deleting process issues no commits makes RELEASE(DEALLOCATE) versus RELEASE(COMMIT) irrelevant.

      For your deleting process, there are almost 6 times as many GETPAGEs as there are rows deleted. That suggests a large number of probes (i.e., from top to bottom) of a 6-level index. That there are (apparently) many index probes suggests further that there may be little to no use by Db2 of index look-aside functionality while the deleting process is executing. Why might that be? At least a couple of possibilities:

      Possibility A: the optimizer did not choose to utilize index look-aside because the catalog statistics gave the optimizer a misleading indication of the scope of an individual execution of the DELETE statement (the optimizer, of course, does not know how many times a DELETE statement in a program loop will be executed). You might want to make sure that catalog statistics for the table, table space and indexes are current and accurate. You might also check the SYSTABLES_PROFILES table in the catalog, to see if Db2 inserted a profile for the table targeted by the DELETE statements (this is applicable if you are using Db2 12 at function level 500 or higher, or Db2 13 - see http://robertsdb2blog.blogspot.com/2019/04/db2-12-for-zos-statistics-profiles-just.html).

      Possibility B: a code bug is preventing index look-aside from being used as it could be. One relatively recent example of this is described by APAR PH28182 (see https://www.ibm.com/support/pages/apar/PH28182).

      Note that with Db2 13 (function level 100), use of index look-aside becomes a run-time decision - that removes a concern related to stale statistics.

      That's about all I can think of at this time.

      Robert

      Delete
    4. Robert,

      Thank you so much for your reply!!

      The customer is using DB 12.

      Below is the information from output of "-DIS GROUP" command

      DB2 LVL = 121505
      CATALOG LEVEL(V12R1M500)
      CURRENT FUNCTION LEVEL(V12R1M100)
      HIGHEST ACTIVATED FUNCTION LEVEL(V12R1M100)
      HIGHEST POSSIBLE FUNCTION LEVEL(V12R1M501)
      PROTOCOL LEVEL(2)

      I do not have access to SYSTABLES_PROFILES table. I will check with my DBA.

      I have loaded both the tables (with & without new index) before doing the testing. Statics were collected when the tables were deleted. After that I ran tests for two SELECT sqls. After than I ran "81" DELETE queries. After that I stopped running any test. I am the only person using these tables for my testing purpose..

      Could you please take a look at the attachement "Index details from RC Query.txt" which I posted in
      https://www.idug.org/discussion/db2-help-needed-to-understand-getpage-activity-in-a-delete-query ?
      It has more information about the indexes from SYSIBM. Could you also check the PLAN_TABLE info in that post, and see if it helps in any way ?

      Looks like the indexes of the tables I tested, are not eligible for FTB as the key length is more than 64 bytes.

      I took the query from https://www.segus.com/2019-08-ftb-fast-traversal-block-for-fit-fast-index-traversal/
      and ran it by removing "A.UNIQUERULE <> 'D'". Both primary index and new index did not appear in the output of this query.

      Two SELECT SQLs which I tested produced very good CPU savings and there was drastic reduction in GETPAGEs.
      For example, for 100 invocation, one SELECT SQL, with new index GETPAGEs were 603, without the new index, GETPAGEs were 545,086.

      Only DELETE seem to be not happy with the new index.!!

      APRA PH28182 seem to be related to FTB. Do you think,we should open a ticket with IBM ?


      Thanks,
      Ravikumar

      Delete
    5. I'm afraid I have too much on my plate to be able to review the detailed information you included in your post to the IDUG forum. You could open a case with IBM Support, but I think it is very unlikely that you have encountered a Db2 for z/OS code defect.

      Comparing GETPAGE counts between SELECT and DELETE processes is not useful or meaningful. They are very different processes, owing to the fact that the latter removes not only data rows but associated index entries. In a situation in which a table has several indexes, each of which has multiple levels, you can certainly have a situation in which a DELETE statement will generate substantially more GETPAGE activity than a SELECT with the same predicates, and the big difference in GETPAGE counts will be related to deletion of index entries. There are all kinds of scenarios in which deletion of index entries will require a lot of top-to-bottom traversals of an index tree structure, and with a 6-level index that can mean a large number of GETPAGEs.

      Robert

      Delete
    6. Robert,

      No problem. Thank you so much for your help!!

      Thanks,
      Ravikumar

      Delete
    7. Robert,
      In future, when you find time, please consider a post on GEPAGES activity of DELETE/UPDATE queries for indexes
      Thanks,
      Ravikumar

      Delete
    8. That's a good suggestion, Ravikumar. I'll see if I can get an entry on that topic posted within the next month or so.

      Robert

      Delete
    9. Robert,
      Thank you so much for your consideration!!
      Thanks,
      Ravikumar

      Delete
  5. Hello Robert,
    I encounter selects where the number executions of the select statement is significantly higher the number of get pages. It's a select with matching 3 index, accessing the ts page as well (not index only) access. The average ration of the number of get pages to the number of selects is 0.4.
    So my question is how DB2 provides the resulted selected row to the application without executing a get page?
    I would expect for 1 select 4 get pages (index level is 3) - 3 for the index pages +1 for the TS page.
    Does FTB eliminate the need for getpages? (I am not sure FTB is implemented in that case).
    The monitor we are using is Detector. May be Detector shows incorrect data?
    Thank you, Michael

    ReplyDelete
  6. Just try to understand, lets say we have a program with two loops of selects from two different tables with matching indexes highly clustered and we have multi-row result sets from the same TS page in both selects. Now after issuing the first select that will return the first row to the program the next select will need to locate the correct TS page in the buffer pool. So how DB2 will now to find the correct page without a get page? There is no an additional memory buffer where the resulted rows would be kept for each select. DB2 must know which page in the buffer pool "belongs" to which select. Where the information about each connection between a particular select and the corresponded page is kept? Is it in the package loaded up into the EDM pool?
    Even so I would expect that at least one get page would be issued for the following select (and shown in the Detector report). Thanks, Michael.

    ReplyDelete
    Replies
    1. Are you talking about singleton SELECTs (i.e., SELECT statements that each qualify a single row) or multi-row result sets (in which case a cursor - the SELECT statement - would be declared and opened, and FETCH statements would retrieve the result set rows)?

      Robert

      Delete
    2. The question is about loops of singleton selects, where each select receives a different set of values for the where predicates. And the access is well "clustered".

      Delete
    3. OK, common scenario. What's happening "under the covers" in this case is not, as I recall, described in the Db2 for z/OS documentation, but based on what I know (or think I know), two things are going on here:
      1) As part of the first execution of the singleton SELECT, Db2 ended up executing GETPAGE requests for an index leaf page (after having executed GETPAGE requests for the relevant pages "above" that one in the index tree) and for the table space page containing the qualifying index key value. The question is, when does Db2 release those pages? I believe it's when Db2 moves on to the "next" page that needs to be accessed for the process in question (of course, if the process ends because it just wanted the one row then there won't be a "next" page - at end-of-transaction any in-use pages will be released). When the process continues with the next execution of the singleton SELECT in the looping operation, I believe a piece of performance-optimizing code will have Db2 check to see if pages it already had in-use for the process (e.g., the aforementioned index leaf page and table space page) contain the data needed to satisfy that next singleton SELECT - why release those pages before going to the "next" index and table space pages without first checking to see if the "next" pages are the ones you already have in-use for the process? If the searched-for key value is in the index leaf page Db2 already has, that can be ascertained without having to do another GETPAGE, and the same is true for the table space page if the index leaf page indicates that the qualifying row is in the table space page that Db2 already has in-use for the process.
      2) Eventually, a singleton SELECT will be executed in the loop and Db2 will see that the searched-for key value is not in the leaf page it already has. Does that mean a fresh top-to-bottom index "probe" will be required? It shouldn't, because of Db2's index look-aside feature. Index look-aside will have Db2 retain not just the leaf page it's "on," but also the non-leaf page that pointed to the leaf page. When Db2 sees that the key searched for by a singleton SELECT in the loop is not in the leaf page it's "on," it can check the "next level up" non-leaf page because (thanks to index look-aside) it still has that page. There's a good chance that the leaf page now required is pointed to by that retained "next-level-up" non-leaf page, and Db2 will just need to execute a GETPAGE request for that leaf page, versus having to do the "n" GETPAGEs required for a full probe of the n-level index.

      Again, this is my understanding of what's going on, and it would explain the low ratio of GETPAGEs to SELECTs that you're seeing.

      Robert

      Delete
  7. Thank you Robert, that what probably happens.
    Knowing that we, as DBAs dealing with performance problems, should not prefer index only access in appropriate cases. That's dramatic! In my case the average CPU per select with index only access is 8 times(!) heaver than matching with TS page access. The way DB2 works is changing (and it's hard to run after a moving target).
    Thanks again. Michael Raz.

    ReplyDelete