Robert's Db2 blog: Db2 for z/OS CHAR versus VARCHAR, Part 2

Last month, I posted to this blog an entry in which I provided my thoughts on the CHAR (fixed-length) versus the VARCHAR (varying-length) data type for character string columns in Db2 for z/OS tables. Soon after, I received a note from a longtime Db2 for z/OS DBA whom I've known for years. He had read the aforementioned blog post, and that prompted him to share with me a story of a VARCHAR issue with which he was presently dealing. I found it to be a quite interesting story, and I am sharing it by way of this post.

My DBA friend made an interesting discovery while analyzing indexes to see if space savings could be achieved via ALTER with NOT PADDED, for indexes created with a PADDED specification and defined on a key containing at least one varying-length column (more on NOT PADDED and index space savings at the end of this entry). He saw that there was a table column, defined as VARCHAR(50), that appeared to contain 50 bytes of data in every row of the table. Curious, he investigated further and found that, in fact, most of the bytes in most of those column values were trailing blanks - for some of the table's rows, ALL of the bytes in the column were blanks. The use of VARCHAR(50) instead of CHAR(50) was not saving ANY space for this column - in fact, it was causing a little more space to be used versus CHAR(50), owing to the 2-byte offset indicator for the VARCHAR column (assuming the use of reordered row format - with basic row format there would still be a two-byte indicator for each varying-length column, but it would be a length indicator instead of an offset indicator).

Now, a Db2 for z/OS DBA might be inclined to think, on having made such a discovery, "What were those application developers thinking? Why in the world would they pad a value (even a non-existent value) with blanks to completely fill up a VARCHAR column?" I would tell you, in that case, "First of all, chill out. Application developers who write programs that access Db2 for z/OS data are your best friends - don't rush to accuse them of malfeasance. Second, this odd fill-up-the-VARCHAR-column application behavior may not be - in fact very likely is not - the fault of a development team."

"Not the developers' fault?" you might ask. "What do you mean? How could it not be the fault of developers?" Two words: code generator. My DBA friend whose story I'm retelling here discovered that it was a code-generation tool that was padding to-be-inserted values, with blanks, to the maximum length allowed by a VARCHAR column. Not only that, but he reached into his personal memory bank and recalled a very similar situation he had encountered some years ago, when working for another company: similarly, values inserted into VARCHAR columns of Db2 tables were padded with blanks to the maximum allowable length, and similarly, this behavior was traced to the action of a code generation tool, as opposed to code written by application developers.

The DBA named names with regard to the code generation tool used at his site, and the different code generator product used at his former place of work. I won't name those products here, but I recognized the names and a lot of you would, too. My point: you may have, in your Db2 for z/OS environment, the same issue my DBA friend encountered in his. You, too, may have VARCHAR columns in Db2 tables that appear to be as full as they can be with data values, when in fact the columns largely - perhaps mostly - contain blanks (hexadecimal '40' values, if we're talking about EBCDIC-encoded tables).

How could you make this determination? You could do what the DBA at the center of this story did: issue a query similar to the one below, for VARCHAR column C1 of table T1 (and note that the really important information about trailing blanks is provided via the RTRIM built-in Db2 function - the STRIP function was used, as well, to see how prevalent leading blanks were in column values, in addition to trailing blanks):

SELECT
SUM(BIGINT(LENGTH(C1))) AS LENGTH_AS_IS,
SUM(BIGINT(LENGTH(RTRIM(C1)))) AS LENGTH_RTRIM,
SUM(BIGINT(LENGTH(STRIP(C1)))) AS LENGTH_STRIP,
COUNT(*) AS ROW_CT
FROM T1
WITH UR;

That query yielded this result:

LENGTH_AS_IS LENGTH_RTRIM LENGTH_STRIP ROW_CT
---------+---------+---------+---------+---------+-------
1,493,634,404 844,348,683 844,345,206 89,019,782

Wow. Over 43% of the space occupied by values in the table column was filled with trailing blanks that had been added by the code generation tool to actual values as part of insert processing (1,493,634,404 - 844,345,206 = 649,285,721 trailing blanks, and 649,285,721 / 1,493,634,404 = 43.4%). The DBA did some due diligence to verify that data-retrieving programs would continue to function properly if, say, a SELECT of C1 from T1 for a given row returned 'ABCDEFGHIJ' instead of 'ABCDEFGHIJ' followed by 40 blanks. When it was determined that the functionality of data-retrieving programs would not be adversely impacted by the removal of trailing blanks from values in column C1 of table T1, the DBA initiated a process of removing those trailing blanks. Execution of that process has freed up disk space and improved the efficiency of Db2 utilities operating on the table. It has also resulted in leaner indexes.

Indexes? Yep, and it was an initial focus on indexes that led to the discovery of the padded-out column values. The DBA noticed that a number of indexes at his site with keys that included varying-length columns were defined with the PADDED option, so that the values would be padded to their maximum length in index entries. He figured that altering these indexes to NOT PADDED would save disk space and potentially lead to a reduction in the number of levels for the indexes (fewer index levels means fewer GETPAGEs for processes that access table rows using the index, and GETPAGEs are a major factor in the CPU cost of SQL statement execution). When the DBA issued an ALTER with NOT PADDED for a PADDED index that had a varying-length column in its key, and saw that the index did NOT get smaller as a result, he investigated to see why that was so, and that investigation led to the discovery of the VARCHAR column whose values were padded to max length with trailing blanks. Thanks to the aforementioned process that is removing those trailing blanks, the change to NOT PADDED for associated indexes is now yielding positive results.

You might want to do a little investigating at your site, to see if you have Db2 tables with VARCHAR columns that in fact are not varying-length because their values are all padded with trailing blanks to their maximum length (the query, shown above, that my DBA friend used could be helpful to you). If you find one or more such columns, do the due diligence that I mentioned - work with developers to ensure that removing the trailing blanks would do no harm, program functionality-wise. If you get an "all clear" from that due diligence work, get rid of those trailing blanks and watch the tables and associated indexes (if the latter are defined with NOT PADDED) become more space-efficient, and watch application and utility processes that access the tables and indexes become more CPU-efficient. This can be your story, too.

2 comments:

akilSeptember 2, 2022 at 9:57 AM
Hi Rob,
There is table with huge data with existing indexes, in addition to this user wants to create new index.what are things we need to check before creating any new index to existing table?
RobertSeptember 2, 2022 at 3:27 PM
It's basically a cost-benefit analysis: will the new index provide sufficient value to offset the cost of the new index (the new index will consume disk space, and will make all inserts and deletes targeting the underlying table a little more expensive (and the same will happen to updates that change the value of an indexed column); and of course there's the CPU time required to create the new index. Tools such as IBM's Db2 SQL Performance Analyzer for z/OS can help you determine if a proposed index would help the performance of a query (see https://www.ibm.com/docs/en/dspafz/5.1.0?topic=analysis-creating-dropping-indexes).

Robert

Note: Only a member of this blog may post a comment.

Robert's Db2 blog

Thursday, October 29, 2020

Db2 for z/OS CHAR versus VARCHAR, Part 2 - a DBA's Story

2 comments: