Sunday, June 26, 2011

A Note on DB2 for z/OS "Conditional GETPAGE Failures"

Recently, a DB2 for z/OS professional asked me about something he'd occasionally seen in a DB2 monitor display: a non-zero value for CONDITIONAL GETPAGE FAILURES for a buffer pool. Now, his DB2 subsystem appeared to be running just fine, but we are rather conditioned (so to speak) to associate the word FAILURE with the meaning NOT GOOD, and he wanted to know if these FAILURES indicated something with which he should be concerned. The short answer to this question is, "No -- that's just an FYI number." In the remainder of this entry I'll provide some background on conditional GETPAGEs and explain why associated "failures" are nothing to lose sleep over.

I expect that you all know what a DB2 GETPAGE is: a request by DB2 for z/OS to access a particular page in a table or an index. For a "regular" GETPAGE (the vast majority are of this type), a "miss" (i.e., a situation in which the requested page is not found in the DB2 buffer pool) will result in DB2 issuing a synchronous read I/O to get the page into the buffer pool from disk storage. In essence, a DB2 synchronous read I/O is the result of a "regular" GETPAGE "failure." Are you concerned about those "failures?" Of course not -- that's just DB2 business as usual: "I (DB2) need to access page 123 of tablespace XYZ. If it's not in the buffer pool, suspend the application process on behalf of which I'm requesting this page, read the page into memory from disk right away, and resume the application process when the read I/O is completed."

So it is with conditional GETPAGE "failures" -- no worries (as the Aussies say -- I love that phrase). What's different about a conditional GETPAGE versus a "regular" GETPAGE: for the former, a "miss" (reported by a DB2 monitor as a "failure") will result in DB2 driving an asynchronous read I/O to get the page into memory and subsequently issuing a "regular" GETPAGE for the page. "Huh?" you might be thinking, "An asynchronous I/O for one page? I thought that asynchronous I/Os (generally associated with prefetch reads) were issued for multiple pages (typically, up to 32 at a time)." This all has to do with WHY DB2 issues conditional GETPAGE requests: it's for parallel I/O operations under a single task. Before DB2 10, conditional GETPAGEs were issued when query I/O parallelism was used by DB2 to access table or index pages (since DB2 for z/OS V4, most folks equate "query parallelism" with query CPU parallelism -- I/O parallelism is what we had before that). Since there's one task, you can have one synchronous I/O going. If DB2 needs a page and it's not in the buffer pool, it can issue an asynchronous I/O (performed under a DB2 task, versus the application process's task) and move on to the next page that it wants (from another tablespace partition, for example). When it's ready to work on that page for which the asynchronous I/O was driven, DB2 will issue a regular GETPAGE for it, and there's a good chance that it'll be in the buffer pool.

So, as I mentioned, I/O parallelism hasn't exactly been front and center in the minds of mainframe DB2 people since query CPU parallelism made the scene with DB2 V4. With DB2 10, there's something new: index I/O parallelism for inserts into a table in a non-segmented tablespace on which multiple indexes are defined. When there's an insert into a such a table (generally, one having three or more indexes), DB2 will of course issue "regular" GETPAGEs (for root, non-root, and leaf pages) for the table's clustering index, because that's how it identifies the target page in the table for the insert. When that's done and the row is inserted (into the target page of the table, or into a page near that one, if the target page is full or locked and space is available in a nearby page), it's time to update the table's indexes with the row's physical location. The clustering index leaf page needing updating is already in memory. DB2 goes to the next index and issues a conditional GETPAGE for the page it needs to access. If there's a "miss" (aka a "failure"), it moves on and issues a conditional GETPAGE for the next index page it needs. Thus you can have the inserting task stay busy while asynchronous read I/O requests are processed in the background. DB2 will issue "regular" GETPAGEs for the "last" of the table's indexes that it's updating, as it did for the clustering index, but again the objective is for DB2 to work in the background to bring pages from the other indexes on the table into memory asynchronously. When the inserting task is ready to update those index pages, the hope is that those pages will already be in memory, thanks to the asynchronous I/Os. Index I/O parallelism for inserts can result in significant reductions in elapsed time for insert operations into tables in non-segmented tablespaces in a DB2 10 environment.

Besides boosting performance, DB2 10 index I/O parallelism for inserts can result in some conditional GETPAGE "failures," as these are what trigger the asynchronous read I/Os for pages not found in the buffer pool. That, of course, is no sweat whatsoever, whether we're talking about DB2 10 or a prior release (in which case, as noted, conditional GETPAGEs are associated with query I/O parallelism). These "failures" are just buffer pool misses, and those happen all the time in a DB2 environment (unless you have a really big buffer pool configuration and a really small database). So when you're looking at a DB2 monitor display of activity for a buffer pool, and you see a non-zero value for conditional GETPAGE failures, think, "OK," and move on to other fields in the display. Hang loose, dude.

Monday, June 13, 2011

Application Programming Tip: Let DB2 do the Work for You

Back in the 1980s, when DB2 was relatively new, you had, of course, a lot of application developers who were learning how to program using SQL (IBM invented the relational database and Structured Query Language). Though SQL is pretty intuitive, mastery takes time and so it was understandable that some programmers early on did in their code things that would more effectively be done by DB2. Examples of what I'm talking about include joining tables in application code (open a cursor on table A, fetch a qualifying row, and look for a matching row in table A), retrieving unordered result sets and sorting the rows programmatically, and failing to leverage the set-oriented nature of SQL (too often retrieving result set rows via a series of singleton SELECTs versus DECLARE CURSOR / OPEN CURSOR / FETCH, FETCH, FETCH, FETCH...).

Interestingly, we're here now at 27 years after DB2's introduction, and there are STILL programmers who are doing in their applications what DB2 should be doing by way of properly coded SQL statements. In this blog entry I'll try to make the case for allowing DB2 to do what it can do in terms of retrieving and updating data. There are several motivating factors when it comes to exploiting the power of SQL as fully as possible. The ones that are most important to me include the following:

Enhance CPU efficiency. When your program issues an SQL statement, obviously that statement has to get from your program to DB2, and after statement execution has completed control has to return to your application code. This program-to-DB2 round trip is not free. The CPU cost of DB2-and-back is (usually) readily grasped in a client-server setting, what with network send and receive processing and all, but there is overhead associated with getting to and from DB2 even when the SQL-issuing program is running on the same server as DB2 (as is often the case in a mainframe system, where CICS and batch programs access a local DB2 database). Get what you want with fewer SQL statements, and you'll reduce the cumulative CPU cost of trips across the application program-DB2 boundary. The effect of getting the job done with fewer trips to DB2 can be quite significant, and here it's important to focus not on a single unit of work but on an overall workload. A developer might run a comparison test between a more-SQL-statements and a fewer-SQL-statements approach to retrieving a particular result set or accomplishing a data change operation, and he might conclude that it's a wash because the elapsed time is about the same either way. Elapsed time is not the issue here -- CPU time is. That's the money metric. And a seemingly small difference in CPU time per transaction (something that a DB2 monitor would show) can end up being a big deal when hundreds of transactions execute per second, or when a batch job has to work through hundreds of thousands or even millions of records in an input file.

Lest you think that this is just a matter of having DB2 and not your application code join tables, I'll tell you: it goes beyond that. To really get things tight from a CPU efficiency perspective, you need to keep up with, and be prepared to take advantage of, new DB2 features and functions that can enable you to reduce "chattiness" between your program and DB2. The MERGE statement (sometimes referred to as "upsert"), with which you can make changes to a table based on a set of input records, updating target table data when there is a match with an input record, and inserting a new row into the table when there isn't a match? That's a CPU saver compared to the old methodology of doing a SELECT against the target table to see if there is a row that matches an input record and then driving an UPDATE (if there is a match) or an INSERT (if there's no match). Multi-row FETCH and multi-row INSERT (sometimes called block FETCH and block INSERT), means of, respectively, getting several rows from DB2 with one FETCH and placing several new rows in a table with one INSERT? Those are CPU savers versus the one-at-a-time way of doing things. Look for ways to do more with less SQL, and you'll reduce the load on your DB2 server.

Application performance consistency. OK, back to joining tables. I like this example because when you do this in application code YOU'RE making the access path decision. Opening a cursor on table A, fetching a qualifying row, and looking for a match in table B is, in essence, a nested loop join. How do you know that nested loop is the right way to go? What about a merge join or a hybrid join (DB2 for z/OS) or a hash join (DB2 for Linux/UNIX/Windows)? When DB2 performs the table join, the SQL optimizer determines the low-cost means of accomplishing the join, based on statistical information stored in DB2 catalog tables. The optimizer is very good at what it does, having been continually enhanced for upwards of thirty years (in addition to inventing the relational database and SQL, IBM invented cost-based SQL statement optimization). Do you really think that you'd do a better job of choosing the most efficient join method for a query? And where does my point about consistency come in? Well, suppose that you do a programmatic join and choose a nested loop method to get the job done. That may in fact be the right choice early on in the life of the application, when target tables -- and the query result set -- are relatively small. What about later, when the database has grown substantially and the query result set is perhaps much larger than before? Would a merge join be the better choice under those circumstances? If it is, and if you let DB2 do the join, the optimizer will take care of that access path change automatically (in the case of dynamic SQL) or with a simple REBIND PACKAGE command (for static SQL) -- no program code changes are needed. Do the join in application code, and you either live with deteriorating performance (if the join method you chose is no longer appropriate), or you change the join logic in your program.

And what about the effect of physical database changes? Suppose that the nested loop join that you accomplish programmatically depends for good performance on the existence of a certain index on a target table? That same table may be over-indexed to the point that insert and delete operations are costing too much. It may be that if the index your programmatic join needs is removed, DB2 -- were it handling the join -- could switch to a merge join and deliver acceptable performance for your query. You can't switch from nested loop to merge join in your program (not without re-write effort), and so your programmatic join may stand in the way of DBAs making a physical database change that could reduce run times for critical data-change operations.

Oh, and things get way more complex when the number of tables joined in a given SELECT statement increases. In what order should the tables be joined? Should the same join method be used all the way through, or would it be best to join A and B via nested loop, and then join that composite table to C via a merge join? And what about tables D, and E, and so on? Do you really want to sort this out yourself?

Bottom line: when you let DB2 do as much as it can, you leave the access path selection process up to DB2. That's a very good move on your part.

Application program simplification. As with reducing program-to-DB2 "chattiness" for improved CPU efficiency, leveraging SQL for application program simplification goes beyond obvious things like having DB2 do things like table joins and result set row grouping and aggregation -- you do best when you learn about and exploit new features and functions that are delivered with every release of DB2. Do you need to assign a number -- perhaps a ranking number -- to each row in a result set based on some criterion of your choice? You could do that with your own program logic, or you could let DB2 do the work via its built-in RANK, DENSE_RANK, or ROW-NUMBER functions. Want to concatenate one character string value to another? Figure out the day of the week for a given date value? Get the number of seconds between midnight and a timestamp value? Overlay some portion of a character string with another string? Find a character string value with a pronunciation that is the same, or close to, that of another string? Serialize an XML value into a character string? All these capabilities, and MANY MORE, are built into DB2 (on all platforms -- mainframe and LUW). Have you checked out the built-in DB2 functions (scalar and aggregate) lately? If not, give 'em a look. Your best source for DB2 function information is the SQL Reference. You can find that manual, for DB2 for z/OS and for DB2 for LUW, at IBM's Web site (the built-in functions are documented in Volume 1 of the DB2 for LUW SQL Reference).

Beyond the built-in functions, you can move work into DB2 (and out of your application code) by way of CASE expressions, which provide for on-the-fly transformation of values returned by, or otherwise processed by, an SQL statements (e.g., "return X when the value in column COL1 is Y"), and with CAST specifications, which tell DB2 to change the data type of column values (e.g., from decimal to integer). Again, the SQL Reference is the place to go for more information.

I haven't even mentioned things like the INTERSECT and EXCEPT result-set-comparison operators, date/time arithmetic, and moving aggregates. DB2 SQL is very powerful and is becoming more so as time goes by. You can leverage that power to remove complexity that would otherwise be in your application program code.

Code re-use. This is a hot topic these days, and with good reason: to the extent that application logic can be encapsulated in a readily reusable form, programmer productivity increases (less reinvent-the-wheel work) and so does organizational agility (application creation and extension is accelerated). DB2 can be a big help here. There are all kinds of ways to push logic into the DB2 level of an application, and when that's done said logic is made available to any kind of program that accesses the database. A simple example of logic-in-DB2 is DB2-defined referential integrity. Why burden programmers with the task of ensuring that data values in a column of a "child" table always have a corresponding value in a "parent" table column, when DB2 can do that? And, when that's implemented at the DB2 level then it's there for the benefit of ALL DB2-accessing programs. Ditto table check constraints, which can ensure, among other things, that values to be inserted into a column must exist within a certain range of values.

Triggers are another way to push logic into the DB2 level of an application infrastructure. These can be used to enforce business rules for insert operations (rules that might be too complex to implement with table check constraints), to automatically maintain values in denormalized database tables ("if column X in table A is updated by a program, perform the same update for column Y in the matching row of table B"), to make otherwise read-only views updateable (this with INSTEAD OF triggers), and more. User-defined functions (UDFs) make certain routines (e.g., to perform certain data transformation operations) available to any program that can issue an SQL statement, and the same is true of DB2 stored procedures (these can be more sophisticated than user-defined functions, and they are invoked via the SQL statement CALL versus being referenced in SELECT statements as are UDFs).

The more data-centric logic is implemented in the DB2 database, the more you as a programmer can do what delivers the greatest value to your employer: writing code that directly addresses the business functionality needs of the organization. If you have some data-access logic that you think could be broadly applicable in your enterprise, talk to a DB2 DBA and see about getting that deployed in the database layer of the application system. A lot of your colleagues could end up benefiting from that move, as their programs will also be able to utilize the DB2-implemented capability.

Now, you've probably heard it said that there's an exception to every rule, and in closing I'll mention one such exception to my "let DB2 do what it can" rule. This exception has to do with sorting result set rows. Sometimes, in a decision support application environment, a user will want the capability to sort data rows returned from DB2 by whatever field he chooses. Having DB2 do that initial sort is the right move, but if the user wants to sort the same result set by a different field, you might consider doing that at the user workstation level (or maybe at the application server level). This might be good for efficiency if the result set is not particularly large (maybe a few hundred rows or less) but a lot of host resources are needed to generate the result set (sometimes a lot of data may be scanned and aggregated to build a small result set). In that case, you might not want to send the query back to DB2 with just a different ORDER BY specification -- maybe you just want to re-sort the 20 or 50 or 100 (or whatever) rows at the client end of things. The real work here was generating the result set. A simple re-sort might best be done locally with respect to the end user (on the other hand, if the query in question is very fast-running and the result set is rather small, having DB2 re-order the rows at the user's request is probably no big deal).

So, with the occasional exception here and there, you really are doing the right thing by letting DB2 do all that it can do with SQL (and remember that some of this logic- and functionality-implementing SQL -- examples include the creation of triggers and UDFs, and the altering of tables to include check constraints or referential integrity rules -- is in a DBA's domain, so get help there when you need it). Your programs are likely to be more efficient, performance will probably be more consistent over time, the code you have to write will be simplified, and your organization overall will benefit from the accessibility and re-usability of capabilities built into your DB2 database. A lot to like, there.