Wednesday, May 13, 2015

For a Large DB2 for z/OS Table, Should You Go With Partition-by-Range or Partition-by-Growth?

There are several aspects to this question: What do I mean by "large?" Is the table in question new, or does it exist already? What is the nature of the data in the table, and how will that data be accessed and maintained? I'll try to cover these various angles in this blog entry, and I hope that you will find the information provided to be useful.

What is a "large" table?

Why even ask this question? Because the relevance of the partition-by-range (PBR) versus partition-by-growth (PBG) question is largely dependent on table size. If a table is relatively small, the question is probably moot because it is unlikely that range-partitioning a smaller table will deliver much value. Partitioning by growth would, in that case, be the logical choice (for many smaller tables, given the default DSSIZE of 4G, a PBG table space will never grow beyond a single partition).

OK, so what is "smaller" and what is "larger" when you're talking about a DB2 for z/OS table? There is, of course, no hard and fast rule here. In my mind, a larger DB2 for z/OS table is one that has 1 million or more rows. That's not to say that a table with fewer than 1 million rows would never be range-partitioned -- it's just that the benefits of range-partitioning are likely to be more appealing for a table that holds (or will hold) millions of rows (or more).

When the table in question is a new one

This, to me, is the most interesting scenario, because it is the one in which the options are really wide open. I'll start be saying that you definitely want to go with a universal table space here, primarily because a number of recently delivered DB2 features and functions require the use of universal table spaces. But should the table space be PBR or PBG? A partition-by-growth table space can be as large as a partition-by-range table space, so that's not a differentiator. What, then, would be your criteria?

To me, the appeal of a PBG table space is mostly a factor of it being a labor-saving device for DB2 for z/OS DBAs. PBG table spaces have an almost "set it and forget it" quality. There is no need to identify a partitioning key, no need to determine partition limit key values, no worries about one partition getting to be much larger than others in a table space. You just choose reasonable DSSIZE and MAXPARTITION values, and you're pretty much done -- you might check back on the table space once in a while, to see if the MAXPARTITION value should be bumped up, but that's about it. Pretty sweet deal if you're a DBA.

On the other hand, PBR can deliver some unique benefits, and these should not be dismissed out of hand. Specifically:
  1. A PBR table space provides maximum partition independence from a utility perspective. You can even run the LOAD utility at the partition level for PBR table space -- something you can't do with a PBG table space. You can also create data-partitioned secondary indexes (DPSIs) on a PBR table space (not do-able for a PBG table space), and that REALLY maximizes utility-related partition independence (though it should be noted that DPSIs can negatively impact the performance of queries that do not reference a PBR table space's partitioning key).
  2. PBR table spaces enable the use of page-range screening, a technique whereby the DB2 for z/OS optimizer can limit the partitions that have to be scanned to generate a result set when a query has a predicate that references a range-partitioned table space's partitioning key (or at least the lead column or columns thereof). Page-range screening doesn't apply to PBG table spaces, because a particular row in such a table space could be in any of the table space's partitions.
  3. A PBR table space can be a great choice for a table that would be effectively partitioned on a time-period basis. Suppose, for example, that the rows most recently inserted into a table are those most likely to be retrieved from the table. In that case, date-based partitioning (e.g., having each partition hold data for a particular week) would have the effect of concentrating a table's most "popular" rows in the pages of the most current partition(s), thereby reducing GETPAGE activity associated with retaining sets of these rows. Date-based partitioning also enables very efficient purging of a partition's data (when the purge criterion is age-of-data) via a partition-level LOAD REPLACE operation with a dummy input data set (the partition's data could be first unloaded and archived, if desired).
  4. A PBR table space tends to maximize the effectiveness of parallel processing, whether of the DB2-driven query parallelization variety or in the form of user-managed parallel batch jobs. This optimization of parallel processing can be particularly pronounced for joins of tables that are partitioned on the same key and by the same limit key values.
Those are some attractive benefits, I'd say. Still, the previously mentioned DBA labor-saving advantages of PBG table spaces are not unimportant. That being the case, this is my recommendation when it comes to evaluating PBR versus PBG for a large, new table: consider first whether the advantages of PBR, listed above, are of significant value for the table in question. If they are, lean towards the PBR option. If they are not, PBG could be the right choice for the table's table space. In particular, PBG can make sense for a large table for which access will be mostly through transactions (as opposed to batch jobs), especially if those transactions will retrieve small result sets via queries for which most row filtering will occur at the index level. In that case, the advantages of range-partitioning could be of limited value.

When the table space in question is an existing one

Here, the assumption is that the table space is not currently of the universal type. When that is true, and the aim is (as it should be) to convert the table space from non-universal to universal, the PBR-or-PBG decision will usually be pretty straightforward and will be based on the easiest path to universal: you'll go with universal PBR for an existing non-universal range-partitioned table space (if it is a table-controlled, versus an index-controlled, partitioned table space), because that change can be accomplished non-disruptively with an ALTER TABLESPACE (to provide a SEGSIZE for the table space) followed by an online REORG (if you are have DB2 10 running in new-function mode, or DB2 11). Similarly, for an existing non-partitioned table space (segmented or simple, as long as it contains only one table), you'll go with universal PBG because that change can be accomplished non-disruptively with an ALTER TABLESPACE (to provide a MAXPARTITIONS value for the table space) followed by an online REORG (again, if your DB2 environment is Version 10 in new-function mode, or DB2 11).

I recently encountered an exception to this rule: if you have a non-universal, range-partitioned table space, with almost all of the data in the last of the table space's partitions (something that could happen, depending on how partition limit keys were initially set), you might decide not to go for the non-disruptive change to universal PBR, because then you'd have a PBR table space with almost all of the data in the last of the table space's partitions. Yes, with enough ALTER TABLE ALTER PARTITION actions, you could get the table's rows to be spread across many partitions (and with DB2 11, alteration of partition limit key values is a non-disruptive change), but that would involve a lot of work. You might in that case just opt to go to a PBG table space through an unload/drop/re-create/re-load process.

To sum things up: PBR and PBG have their respective advantages and disadvantages. In choosing between these flavors of universal table space, the most important thing is to put some thought into your decision. Give careful consideration to what PBR might deliver for a table, and think also of how useful PBG might be for the same table. If you weigh your options, the decision at which you ultimately arrive will likely be the right one.

6 comments:

  1. With regard to partitioning smaller tables -- this might also be done (PGR here nand not PBG) to promote parallell processing in batch, running multiple jobs in batch in parallell. More of an operational consideration.

    ReplyDelete
  2. The first question I ask when determining PBG vs. PBR is whether there is a good partitioning key. Sadly where I work the data modellers are surrogate key happy with that key based upon some monotonic value such as a sequence. These do not make good partitioning keys.

    Michael Harper/TD Bank

    ReplyDelete
    Replies
    1. That's one of the great things about partition-by-growth table spaces (and apologies for the late response): they extend many of the benefits of partitioned table spaces (data capacity, portion-level utility execution, etc.) to tables that do not have a good partitioning key.

      Robert

      Delete
  3. So lets say you have a table that is PBG (no good partitioning key for PBR) and you anticipate the data will occupy 50 - 60 Gig. Would you recommend allocating the tablespace with DSSIZE of 64G, so that all the data fits in one dataset? Or would you allocate something smaller, say DSSIZE of 4G and 16 or more parts to try and get some parallelism benefits (or maybe some other reason I haven't thought of)? Of course I'm anticipating a "depends" answer :-)

    ReplyDelete
    Replies
    1. Apologies for the delay in responding, Jim.

      That's a good question. I suppose it would depend (as you anticipated) on a number of factors. Some thoughts that come to my mind:
      * One factor, as you mentioned, could be the desire (or not) for parallelism (of the query variety or the utility variety). More partitions would up the parallelism factor, but that may or may not be of value to your organization.
      * Another factor could be the particulars of the disk subsystem in which the table would be located. At some sites, not all of the mainframe-attached volumes are set up to be able to hold extra-large (greater than 4 GB) data sets. At such a site, 4G data sets might make more of the disk volumes available for holding the table space's partitions.
      * Another disk-related factor would be average utilization of mainframe-attached volumes. At some organizations there is a desire to drive disk volume utilization to very high levels. That can mean small chunks of available space on volumes, and THAT could conceivably lead to a situation in which extension of a really big (e.g., 64G) data set fails due to hitting the maximum number of volumes across which a data set can be spread (I believe that this limit is 57 or something like that). It would be easier to go with a larger data set size at a site at which disk volume utilization is not driven to a really high level (moderately high disk space utilization, versus very high utilization, might mean spending a little more on storage, but it provides greater operational flexibility).
      * MAYBE the DB2 for z/OS DSMAX limit could be a consideration. You'd think that this wouldn't normally be the case, as DSMAX can be up to 200,000, but of course that's a theoretical limit. Below-the-bar virtual storage can be in shorter supply at some sites than at others, and that might require a not-so-big DSMAX value, and if that value is already being bumped up against with some frequency (resulting in physical data set close activity, which can be a drag on performance when these data sets later have to be re-opened), you might opt to have fewer, larger data sets.

      That's about all I can think of at this time. Hope this helps.

      Robert

      Delete