Sunday, February 21, 2016

Statement-Level Control of DB2 for z/OS Query Parallelization

I'm writing about this now because of the increased interest I've seen, at various DB2 for z/OS sites, in using DB2 query parallelization as a means of boosting zIIP engine utilization when -- and this often happens during nighttime hours -- workloads that typically drive consumption of zIIP MIPS (e.g., DB2 DDF transactions, or Java programs running in WebSphere Application Server for z/OS) diminish. During periods of heavy overnight batch processing, people might observe very high utilization levels for general-purpose engines in a z/OS LPAR, while zIIP engines in the system are not at all busy. It might be nice, in such cases, if one could shift some batch work from general-purpose to zIIP engines, but how might that be accomplished (and here I'm talking about batch programs written in a language such as COBOL, versus Java -- Java batch programs, like any Java programs in a z/OS environment, would be zIIP-eligible)? Getting some DB2 query parallelization going for batch programs would be a means of moving a portion of their processing from general-purpose to zIIP engines, as the "child tasks" associated with a parallelized query can have up to 80% of their processing handled by zIIP engines. Problem was, getting this done required, until somewhat recently, that a batch program's DB2 package be bound with DEGREE(ANY). Binding a package with that option makes ALL of the program's queries candidates for parallelization. What if you only want a particular statement in the package to be parallelize-able?

People contemplating utilization of query parallelization for dynamic queries faced a similar conundrum. They could set the value of the DB2 ZPARM parameter CDSSRDEF, which establishes the default value of the CURRENT DEGREE special register, to ANY, and that would enable parallelism for dynamic queries, but it would make ALL dynamic queries candidates for parallelization by default. That might be OK for a DB2 subsystem dedicated to an analytics workload, but it probably isn't the situation you want for a DB2 subsystem that supports high-volume transaction and/or batch applications. True, setting the value of CURRENT DEGREE to ANY prior to issuance of a dynamic query by a program, and then changing CURRENT DEGREE back to 1, would be a way to limit parallelization to a particular query, but changing the CURRENT DEGREE special register value is not always going to be feasible. Suppose a query is generated by an analytics tool? How would CURRENT DEGREE be set in that case? What if the dynamic query of interest is issued by an application purchased from a vendor? That's not a scenario that would lend itself to changing a special register like CURRENT DEGREE.

And I've just been talking here about making a query parallelize-able by DB2. What about the degree of parallelization? The PARAMDEG parameter in ZPARM allows you to control the degree to which a parallelized query can be split, but that's at the DB2 subsystem level. A low degree of parallelization (e.g., 4) might be appropriate if the aim of parallelization is getting some work shifted from general-purpose to zIIP engines, and there is a desire to keep the number of parallel tasks for split queries from getting very high, but would that be a good degree of parallelization if the objective is to get a query generated and submitted via an analytics tool to complete as quickly as possible? Would not a higher degree of parallelization better support that aim?

These formerly sticky problems -- being selective about parallelization for both static and dynamic queries, and having a right-for-the-moment degree of parallelization -- were very nicely addressed via a new catalog table, SYSQUERYOPTS, and related enhancements that were delivered with DB2 10 for z/OS. Although DB2 10 has been out there for more than five years, I've found that SYSQUERYOPTS escaped the notice of many a DB2 DBA. I'm telling you now: notice this. It gives you statement-level control over both parallelization and degree of parallelization.

How is this done? It's pretty easy: you create, for your user ID, a DSN_USERQUERY_TABLE (the DDL for this table, and descriptions of the table's columns, can be found online in the Knowledge Center for DB2 10 and DB2 11). Then you insert into that table a row for the query for which you want to provide values for parallelization (i.e., whether the query is a candidate for parallelization, and the degree of parallelization). This inserted row will include the text of the query that is to be parallelized. You could include the query text in the insert statement that populates DSN_USERQUERY_TABLE, but it's recommended that, for static SQL, you instead pull the statement text from the SYSPACKSTMT catalog table (an example of an INSERT statement that targets DSN_USERQUERY_TABLE and gets a query's text from SYSPACKSTMT can be found in the Knowledge Center for DB2 10 and DB2 11). Similarly, for a dynamic query for which you want to provide parallelism parameters, it's best to get the statement text from the DB2 dynamic statement cache (this text, for each statement in the cache, is placed in the DSN_STATEMENT_CACHE_TABLE when you issue EXPLAIN STMTCACHE ALL). In addition to the query text, the row you insert into DSN_USERQUERY_TABLE will contain the value 'ANY' in the DEF_CURR_DEGREE column (indicating that the query is a candidate for parallelization by DB2) and some integer value in the MAX_PAR_DEGREE column (this is your specification of the maximum degree of parallelism for the query). You can also indicate that these parallelism specifications (i.e., that this query is a candidate for parallelism, and, if parallelized, can be split into as many as X number of child queries) apply to the statement whenever it is issued, or only when it is issued in association with a certain package in a certain collection. After inserting this row into your DSN_USERQUERY_TABLE, you would issue the BIND QUERY command, and that would cause rows to be inserted into the SYSQUERY and SYSQUERYOPTS catalog tables (another catalog table introduced with DB2 10, SYSQUERYPLAN, can also be populated via execution of the BIND QUERY command, but that's done when you want to provide an access path for a query, and that's not the subject of this blog entry).

After you've executed the BIND QUERY command (and you might issue the command a second time, with the LOOKUP(YES) option specified, to verify that the first issuance of the command populated SYSQUERY and SYSQUERYOPTS as expected), the parallelism options you selected for the target query will go into effect when the query is next prepared for execution. That will happen, for a static query, when the package of which the statement is a part is rebound, and for a dynamic query when the statement next goes into the dynamic statement cache.

A couple more comments. First, for all this to work, the value of the OPTHINTS parameter in ZPARM has to be YES (the default value is NO). Second, in addition to providing a means of activating and controlling query parallelism at the statement level, the SYSIBM.SYSQUERYOPTS table (and SYSIBM.SYSQUERY and DSN_USERQUERY_TABLE and the BIND QUERY command) enables specification of the REOPT bind option and activation of star join functionality at the statement level.

That's it in a nutshell. Through the mechanism I've described herein, a static SELECT issued by a nighttime batch program might be parallelized and split 4 ways, while on the same DB2 system a particular dynamic query generated by an analytics tool is parallelized and split 20 ways when it is executed in the middle of the online day. Such fine-grained control of query parallelism -- determining the individual statements for which it can be used, and the associated degree of parallel processing -- could open up new use cases for parallelism in your environment. Give it some thought.

No comments:

Post a Comment