Back in the 1980s, when DB2 was relatively new, you had, of course, a lot of application developers who were learning how to program using SQL (IBM invented the relational database and Structured Query Language). Though SQL is pretty intuitive, mastery takes time and so it was understandable that some programmers early on did in their code things that would more effectively be done by DB2. Examples of what I'm talking about include joining tables in application code (open a cursor on table A, fetch a qualifying row, and look for a matching row in table A), retrieving unordered result sets and sorting the rows programmatically, and failing to leverage the set-oriented nature of SQL (too often retrieving result set rows via a series of singleton SELECTs versus DECLARE CURSOR / OPEN CURSOR / FETCH, FETCH, FETCH, FETCH...).
Interestingly, we're here now at 27 years after DB2's introduction, and there are STILL programmers who are doing in their applications what DB2 should be doing by way of properly coded SQL statements. In this blog entry I'll try to make the case for allowing DB2 to do what it can do in terms of retrieving and updating data. There are several motivating factors when it comes to exploiting the power of SQL as fully as possible. The ones that are most important to me include the following:
Enhance CPU efficiency. When your program issues an SQL statement, obviously that statement has to get from your program to DB2, and after statement execution has completed control has to return to your application code. This program-to-DB2 round trip is not free. The CPU cost of DB2-and-back is (usually) readily grasped in a client-server setting, what with network send and receive processing and all, but there is overhead associated with getting to and from DB2 even when the SQL-issuing program is running on the same server as DB2 (as is often the case in a mainframe system, where CICS and batch programs access a local DB2 database). Get what you want with fewer SQL statements, and you'll reduce the cumulative CPU cost of trips across the application program-DB2 boundary. The effect of getting the job done with fewer trips to DB2 can be quite significant, and here it's important to focus not on a single unit of work but on an overall workload. A developer might run a comparison test between a more-SQL-statements and a fewer-SQL-statements approach to retrieving a particular result set or accomplishing a data change operation, and he might conclude that it's a wash because the elapsed time is about the same either way. Elapsed time is not the issue here -- CPU time is. That's the money metric. And a seemingly small difference in CPU time per transaction (something that a DB2 monitor would show) can end up being a big deal when hundreds of transactions execute per second, or when a batch job has to work through hundreds of thousands or even millions of records in an input file.
Lest you think that this is just a matter of having DB2 and not your application code join tables, I'll tell you: it goes beyond that. To really get things tight from a CPU efficiency perspective, you need to keep up with, and be prepared to take advantage of, new DB2 features and functions that can enable you to reduce "chattiness" between your program and DB2. The MERGE statement (sometimes referred to as "upsert"), with which you can make changes to a table based on a set of input records, updating target table data when there is a match with an input record, and inserting a new row into the table when there isn't a match? That's a CPU saver compared to the old methodology of doing a SELECT against the target table to see if there is a row that matches an input record and then driving an UPDATE (if there is a match) or an INSERT (if there's no match). Multi-row FETCH and multi-row INSERT (sometimes called block FETCH and block INSERT), means of, respectively, getting several rows from DB2 with one FETCH and placing several new rows in a table with one INSERT? Those are CPU savers versus the one-at-a-time way of doing things. Look for ways to do more with less SQL, and you'll reduce the load on your DB2 server.
Application performance consistency. OK, back to joining tables. I like this example because when you do this in application code YOU'RE making the access path decision. Opening a cursor on table A, fetching a qualifying row, and looking for a match in table B is, in essence, a nested loop join. How do you know that nested loop is the right way to go? What about a merge join or a hybrid join (DB2 for z/OS) or a hash join (DB2 for Linux/UNIX/Windows)? When DB2 performs the table join, the SQL optimizer determines the low-cost means of accomplishing the join, based on statistical information stored in DB2 catalog tables. The optimizer is very good at what it does, having been continually enhanced for upwards of thirty years (in addition to inventing the relational database and SQL, IBM invented cost-based SQL statement optimization). Do you really think that you'd do a better job of choosing the most efficient join method for a query? And where does my point about consistency come in? Well, suppose that you do a programmatic join and choose a nested loop method to get the job done. That may in fact be the right choice early on in the life of the application, when target tables -- and the query result set -- are relatively small. What about later, when the database has grown substantially and the query result set is perhaps much larger than before? Would a merge join be the better choice under those circumstances? If it is, and if you let DB2 do the join, the optimizer will take care of that access path change automatically (in the case of dynamic SQL) or with a simple REBIND PACKAGE command (for static SQL) -- no program code changes are needed. Do the join in application code, and you either live with deteriorating performance (if the join method you chose is no longer appropriate), or you change the join logic in your program.
And what about the effect of physical database changes? Suppose that the nested loop join that you accomplish programmatically depends for good performance on the existence of a certain index on a target table? That same table may be over-indexed to the point that insert and delete operations are costing too much. It may be that if the index your programmatic join needs is removed, DB2 -- were it handling the join -- could switch to a merge join and deliver acceptable performance for your query. You can't switch from nested loop to merge join in your program (not without re-write effort), and so your programmatic join may stand in the way of DBAs making a physical database change that could reduce run times for critical data-change operations.
Oh, and things get way more complex when the number of tables joined in a given SELECT statement increases. In what order should the tables be joined? Should the same join method be used all the way through, or would it be best to join A and B via nested loop, and then join that composite table to C via a merge join? And what about tables D, and E, and so on? Do you really want to sort this out yourself?
Bottom line: when you let DB2 do as much as it can, you leave the access path selection process up to DB2. That's a very good move on your part.
Application program simplification. As with reducing program-to-DB2 "chattiness" for improved CPU efficiency, leveraging SQL for application program simplification goes beyond obvious things like having DB2 do things like table joins and result set row grouping and aggregation -- you do best when you learn about and exploit new features and functions that are delivered with every release of DB2. Do you need to assign a number -- perhaps a ranking number -- to each row in a result set based on some criterion of your choice? You could do that with your own program logic, or you could let DB2 do the work via its built-in RANK, DENSE_RANK, or ROW-NUMBER functions. Want to concatenate one character string value to another? Figure out the day of the week for a given date value? Get the number of seconds between midnight and a timestamp value? Overlay some portion of a character string with another string? Find a character string value with a pronunciation that is the same, or close to, that of another string? Serialize an XML value into a character string? All these capabilities, and MANY MORE, are built into DB2 (on all platforms -- mainframe and LUW). Have you checked out the built-in DB2 functions (scalar and aggregate) lately? If not, give 'em a look. Your best source for DB2 function information is the SQL Reference. You can find that manual, for DB2 for z/OS and for DB2 for LUW, at IBM's Web site (the built-in functions are documented in Volume 1 of the DB2 for LUW SQL Reference).
Beyond the built-in functions, you can move work into DB2 (and out of your application code) by way of CASE expressions, which provide for on-the-fly transformation of values returned by, or otherwise processed by, an SQL statements (e.g., "return X when the value in column COL1 is Y"), and with CAST specifications, which tell DB2 to change the data type of column values (e.g., from decimal to integer). Again, the SQL Reference is the place to go for more information.
I haven't even mentioned things like the INTERSECT and EXCEPT result-set-comparison operators, date/time arithmetic, and moving aggregates. DB2 SQL is very powerful and is becoming more so as time goes by. You can leverage that power to remove complexity that would otherwise be in your application program code.
Code re-use. This is a hot topic these days, and with good reason: to the extent that application logic can be encapsulated in a readily reusable form, programmer productivity increases (less reinvent-the-wheel work) and so does organizational agility (application creation and extension is accelerated). DB2 can be a big help here. There are all kinds of ways to push logic into the DB2 level of an application, and when that's done said logic is made available to any kind of program that accesses the database. A simple example of logic-in-DB2 is DB2-defined referential integrity. Why burden programmers with the task of ensuring that data values in a column of a "child" table always have a corresponding value in a "parent" table column, when DB2 can do that? And, when that's implemented at the DB2 level then it's there for the benefit of ALL DB2-accessing programs. Ditto table check constraints, which can ensure, among other things, that values to be inserted into a column must exist within a certain range of values.
Triggers are another way to push logic into the DB2 level of an application infrastructure. These can be used to enforce business rules for insert operations (rules that might be too complex to implement with table check constraints), to automatically maintain values in denormalized database tables ("if column X in table A is updated by a program, perform the same update for column Y in the matching row of table B"), to make otherwise read-only views updateable (this with INSTEAD OF triggers), and more. User-defined functions (UDFs) make certain routines (e.g., to perform certain data transformation operations) available to any program that can issue an SQL statement, and the same is true of DB2 stored procedures (these can be more sophisticated than user-defined functions, and they are invoked via the SQL statement CALL versus being referenced in SELECT statements as are UDFs).
The more data-centric logic is implemented in the DB2 database, the more you as a programmer can do what delivers the greatest value to your employer: writing code that directly addresses the business functionality needs of the organization. If you have some data-access logic that you think could be broadly applicable in your enterprise, talk to a DB2 DBA and see about getting that deployed in the database layer of the application system. A lot of your colleagues could end up benefiting from that move, as their programs will also be able to utilize the DB2-implemented capability.
Now, you've probably heard it said that there's an exception to every rule, and in closing I'll mention one such exception to my "let DB2 do what it can" rule. This exception has to do with sorting result set rows. Sometimes, in a decision support application environment, a user will want the capability to sort data rows returned from DB2 by whatever field he chooses. Having DB2 do that initial sort is the right move, but if the user wants to sort the same result set by a different field, you might consider doing that at the user workstation level (or maybe at the application server level). This might be good for efficiency if the result set is not particularly large (maybe a few hundred rows or less) but a lot of host resources are needed to generate the result set (sometimes a lot of data may be scanned and aggregated to build a small result set). In that case, you might not want to send the query back to DB2 with just a different ORDER BY specification -- maybe you just want to re-sort the 20 or 50 or 100 (or whatever) rows at the client end of things. The real work here was generating the result set. A simple re-sort might best be done locally with respect to the end user (on the other hand, if the query in question is very fast-running and the result set is rather small, having DB2 re-order the rows at the user's request is probably no big deal).
So, with the occasional exception here and there, you really are doing the right thing by letting DB2 do all that it can do with SQL (and remember that some of this logic- and functionality-implementing SQL -- examples include the creation of triggers and UDFs, and the altering of tables to include check constraints or referential integrity rules -- is in a DBA's domain, so get help there when you need it). Your programs are likely to be more efficient, performance will probably be more consistent over time, the code you have to write will be simplified, and your organization overall will benefit from the accessibility and re-usability of capabilities built into your DB2 database. A lot to like, there.
No comments:
Post a Comment