Wednesday, September 30, 2020

Db2 for z/OS Character Columns: Thoughts on CHAR versus VARCHAR

Recently, I received from a Db2 for z/OS DBA a question: for a table column that will hold character string values, are there any recommendations concerning the use of CHAR (fixed-length) versus VARCHAR (variable-length) for the column's data type?

That question has been asked, I think, many times by many different people over the 35+ years since Db2 made the scene. Even so, I find it to be an interesting query to this day. Why? Because various Db2 developments - some relatively recent and some from way back - have caused the answer to change over time. You can read on and see if you agree with me.

Now, it is of course true that in at least one case the question has a very simple answer: if a value to be stored in the column can exceed 255 byes in length, CHAR cannot be the data type because 255 bytes is the maximum length you can specify for a fixed-length Db2 character string column. Longer than that and it has to be VARCHAR, right? Probably right, but maybe not - you could go with CLOB. Do you think that CLOB is the right choice only when the length of a character string value that will be stored in a column might exceed 32,704 bytes - the maximum length of a VARCHAR column? I'd say that that is not always true - there could be a situation in which you'd choose CLOB over VARCHAR even if no value longer than 32,704 bytes will go in the column. Suppose that the maximum length of a character string value that will be stored in a table is "only" 20,000 bytes, but the column will very rarely be referenced in a query's select-list and maybe never (or almost never) referenced in a query predicate. In that case, CLOB could be a good choice for the column's data type. Why? Because CLOB values (unless in-lined, and you wouldn't inline CLOB values that are rarely retrieved and rarely referenced in predicates) are physically stored in an auxiliary table in a LOB table space. The LOB table space could (should) be assigned to a buffer pool other than the one used for the base table's table space, and THAT would allow a lot more of the base table's pages (sans the LOB values) to be cached in a given amount of buffer pool space. You'd also have more rows in each base table page. More pages in memory and more rows per page means fewer GETPAGEs, fewer read I/Os, better CPU efficiency and better throughput. CLOB can be the right choice for a character string column, even when VARCHAR is technically usable.

How about the other end of the column-length spectrum: super-short character strings? Here again, the best choice is often pretty clear, owing to the two bytes of "baggage" that come along with a variable-length character column. For many years, you knew what that two bytes of baggage was: it was the length indicator for the VARCHAR column. These days, it could be a length indicator, or it could be an offset indicator. It will be one or the other depending on whether basic row format is used for the table's table space or partition (in which case it will be a length indicator) or reordered row format is in effect (in which case it is an offset indicator). I'll have more to say about basic versus reordered row format in a moment, but either way you're looking at two extra bytes for a varying-length value. Because of those two extra bytes, going with VARCHAR for a column that will hold hold values with a maximum length of one or two bytes does not make the best use of disk or buffer pool space - for optimal efficiency you should use CHAR(1) or CHAR(2) for a column that will hold values that will be, respectively, one byte or not more than two bytes in length.

Having considered values longer than 255 bytes and shorter than 3 bytes, we are left with the byte-length range 3-255 to ponder. Within that range, you could go with CHAR or VARCHAR. What might influence your decision? Here are some thoughts:

  • What is the variability in length of values that will go in the table column? The greater the variability, the more space that could be saved via use of VARCHAR versus CHAR (though table space compression, which is widely used, offsets much of this advantage of VARCHAR).
  • How important is it to make access to the data values in a column absolutely as efficient as possible? Maximum efficiency of access is generally associated with fixed-length columns, though reordered row format (RRF) makes access to VARCHAR columns more efficient than it used to be. How does reordered row format do that? It does that by physically storing variable-length columns at the end of a table's rows (logical column order is not affected by reordered versus basic row format) and preceding those variable-length columns with a series of 2-byte offset indicators - one for each variable-length column. The offset indicator for a given variable-length column shows where in the row the first byte of that column can be found. This is good for performance because to access a value in a variable-length column, Db2 goes straight to that column's offset indicator (easily done, because Db2 knows which of those 2-byte indicators it is going to, and the offset indicators are preceded by whatever fixed-length columns are in the row). Having gotten efficiently to the right offset indicator, Db2 sees exactly where to go to get to the variable-length column's first byte - boom-boom, done. Contrast that with the basic row format (BRF) situation, in which physical column order is the same as logical order in a row. If Db2 wants to access the fourth variable-length column in a BRF row, it can't get there without checking the length of each of the three variable-length columns ahead of the column of interest. Now, does RRF make access to variable-length columns just as efficient as access to fixed length columns? Not quite, but RRF does reduce the access-efficiency gap between fixed-length and variable-length columns.
  • How important is it to minimize the volume of data written to the Db2 log? Where UPDATE operations are concerned, Db2 tends to log fewer bytes when fixed-length character columns are updated versus updating of values in variable-length columns. That said, as reordered row format reduces the access-efficiency penalty associated with variable-length columns (as explained above), so too does RRF reduce the logging-volume penalty that goes along with such columns. How so? Well, when one fixed-length column in a row is updated, just that column change is logged. When a variable-length column is updated and that update changes the row's length (typically the case), data is logged from the first changed byte to the end of the row. Because RRF physically moves variable-length columns to the end of data rows, there will likely be fewer bytes between a variable-length column and the end of its row, and that means fewer bytes written to the log when that variable-length column is updated in a way that changes the column's (and thus, the row's) length. I have to say here that data compression muddies the water a bit, since with compression in effect an update of even a fixed-length column could change a row's length, leading to more bytes logged due to update activity (though that effect can be mitigated by the fact that data associated with changes to compressed tables is logged in compressed form). Still, when all is said and done, it is likely that maximum efficiency from a log-write perspective would be achieved by use of fixed-length versus variable-length columns, particularly for tables in uncompressed table spaces. You might have - or might want to have - heavily updated tables in uncompressed table spaces when those tables are not too large (the value of compression is a little less compelling for tables that are not so big).
  • Do you want to avoid row-location changes that result from UPDATEs that increase a row's length? When an UPDATE of a variable-length column increases a row's length so that the row will no longer fit in its page, Db2 will relocate the row in the table and will put in the old location a pointer to the row's new location. That's called an indirect reference, and those can have a negative impact on data-access efficiency (REORGs clean those indirect references up). Use of fixed-length character columns tends to result in fewer indirect references for a table (with data compression being a bit of a muddying factor, as it is for log-write volumes, and for the same reason: an update of a fixed-length column in a compressed row can cause that row to get longer). That said, the indirect-reference penalty associated with variable-length columns can be mitigated through use of the PCTFREE FOR UPDATE table space specification that was introduced with Db2 11 for z/OS (I posted an entry on PCTFREE FOR UPDATE to this blog a few years ago).

And with that, I've provided just about everything I can think of on the subject of CHAR versus VARCHAR for character string values. I'd sum it up this way: the CHAR versus VARCHAR choice is most interesting for character string values that will be between 3 and 255 bytes in length. For such columns, there are situations that would favor CHAR versus VARCHAR, and vice versa. Even so, Db2 features and functions such as table space compression, reordered row format, and PCTFREE FOR UPDATE can reduce the negative effects associated with the use of CHAR or VARCHAR in these various situations. My advice to you: take advantage of Db2 functionality, and make the CHAR and VARCHAR decisions that seem reasonable to you given your application and data environment.