Thursday, April 28, 2022

Db2 for z/OS: Answering Some Questions About DBATs (i.e., DDF Threads)

Sometimes, a DBA will email me a question about Db2 for z/OS, and I'll respond in a pretty comprehensive way, and I'll look at that outbound message and think to myself, "Hmm. I may have written most of a blog entry there." This is one of those cases. I recently got a couple of questions about Db2 database access threads, or DBATs (threads associated with access to a Db2 subsystem through the distributed data facility, aka DDF), and I think the questions and answers might be of interest to a lot of people in the Db2 community. I am therefore packaging them in this blog entry. Off we go:


Question: When does a DBAT go from being active to idle, so that it is subject to the idle thread timeout setting for the Db2 subsystem?

Answer: OK, the first thing to understand here is that a DBAT is always active. Even when it's in the DBAT pool, a DBAT is active - it's just in a so-called disconnected state. A DBAT is idle when it is in-use (i.e., paired with a connection, which happens when a transaction comes along by way of that connection) and it’s not doing anything (or, more accurately, the transaction that was using the DBAT appears to Db2 to be doing nothing). It's normal for there to be some idle thread time for a DDF transaction - a client-side program issues a SQL statement, the result is sent back to that program, and the DBAT is briefly idle until the transaction's next SQL statement is issued. No big deal there. It's when idle time becomes really elongated that a DBAT might be affected by the Db2 subsystem's idle thread timeout value. That timeout value is specified via the IDTHTOIN parameter in the Db2 DSNZPARM module. The default value for IDTHTOIN is 120 seconds (check out the value on your Db2 system, and see if it's set to something other than 120).

Normally, at end-of-transaction there is a commit, and at that time the DBAT that had been used in processing the transaction goes back to the DBAT pool and the connection with which the DBAT had been paired goes back to an inactive state (inactive connections, a key contributor to Db2's connection scalability, are a server-side thing, invisible to a connected application - an inactive connection will go back to an active state when the next transaction associated with the connection begins). Can a DBAT in the pool be affected by the Db2 system's idle thread timeout value? No, but it is subject a limit specified by another ZPARM parameter called POOLINAC (more on that to come).

Let's say that a DDF transaction starts but then never commits. That could happen because of a problem on the client application side, or it could be that the developer of the transaction program decided that a commit is not necessary because the transaction is read-only in nature (that in fact would be a not-good decision - every DDF transaction needs to commit, because even a read-only transaction will hold one or more table space or partition locks and one or more claims on database objects, and those locks and claims will not be released without a commit). Because the transaction has not committed, it is perceived by Db2 to be still in-flight, and for that reason the transaction's DBAT can't be separated from the associated connection and returned to the DBAT pool. The apparently (to Db2) in-flight transaction continues to do nothing, and the related DBAT remains idle for a longer and longer period of time. Eventually the IDTHTOIN limit will be reached for the idle thread (unless IDTHTOIN is set to 0, which means a DBAT can remain indefinitely idle), and Db2 terminates the DBAT and the associated connection.

So, to recap: first, a DBAT does not go from active to idle, because a DBAT is always considered to be active - it's an in-use DBAT, as opposed to an in-the-pool DBAT, that can be idle. Second, an in-use DBAT will typically have at least some idle time (time when it seems to Db2 that the transaction associated with the DBAT is not doing anything - or, at least, not doing any SQL-related thing); it's when that "nothing SQL-related is happening" time gets long that the Db2 idle thread timeout limit can be reached for a DDF transaction and its DBAT.

[By the way, I mentioned earlier that when Db2 terminates a DDF transaction and its DBAT due to the idle thread timeout limit being reached, Db2 also terminates the connection with which the DDF transaction had been associated. If you'd prefer for Db2 to preserve the connection while terminating the transaction and the DBAT, you can get that behavior thanks to an enhancement introduced with Db2 12 for z/OS. The enhancement is enabled via specification of EXCEPTION_ROLLBACK as an attribute of a MONITOR IDLE THREADS row in the Db2 table SYSIBM.DSN_PROFILE_ATTRIBUTES. You can find more information about this enhancement in the Db2 for z/OS online documentation, at https://www.ibm.com/docs/en/db2-for-zos/12?topic=threads-monitoring-idle-by-using-profile-tables.]


Question: We had a DDF transaction surge, and as a result the number of DBATs went way up. The surge passed, and several minutes later I checked on the number of DBATs and it was still way high. What's with that?

Answer: There were (I'm pretty sure) two factors involved here. First, the POOLINAC value. That's a ZPARM parameter. If a DBAT in the pool has gone a POOLINAC number of seconds without being reused for a transaction, that DBAT will be subject to termination by Db2. For the Db2 subsystem looked after by the DBA who asked me this question, the POOLINAC value was 900 seconds, considerably higher than the default value of 120 seconds (I personally favor setting POOLINAC to the default value of 120 and leaving it there unless there's a good reason to make a change). A high POOLINAC value will definitely slow down the trimming of the number of pooled DBATs after the passing of a DDF transaction surge, but I think something else was going on, as well.

That "something else" was a change in Db2's termination of "too-long-in-the-pool-without-reuse" DBATs, effected by the application of the fix for Db2 APAR PH36114 (that fix came out in June of 2021, and I say that "I think" the fix was involved in this situation, rather than "I know," because I did not verify that the fix was applied to the Db2 subsystem in question - I'm inferring that based on the behavior reported by the DBA). Here's the deal: prior to the change associated with PH36114, Db2 would check the DBAT pool every two minutes to see if any DBATs in the pool had been there for a POOLINAC number of seconds without being reused. However many "too-long-in-the-pool-without-reuse" DBATs were found, were terminated by Db2. If a lot of DBATs went back to the pool at around the same time following the rapid subsidence of a DDF transaction surge, Db2 might find in a subsequent pooled DBAT purge cycle that a lot of DBATs needed to be terminated at one time due to the POOLINAC limit being exceeded. With the ZPARM parameter REALSTORAGE_MANAGEMENT set to AUTO or ON (and AUTO is the default), terminating a lot of DBATs at one time could put a good bit of pressure on the z/OS LPAR's ESQA resource, which in turn could cause spill-over into ECSA, which in turn could be bad news for an LPAR with only a small cushion of unused ECSA space.

With the fix for PH36114 applied, two things changed in this picture:
  1. Db2 went from checking every two minutes for DBATs with "too-long-in-the-pool-without-reuse" status to doing that every 15 seconds.
  2. In a given purge cycle (again, now once every 15 seconds), Db2 will terminate a maximum of 50 DBATs in the "too-long-in-the-pool-without-reuse" category.
What this means: a big pile of pooled DBATs left over from a since-passed DDF transaction surge will be worked down more frequently and more gradually. That could somewhat elongate the process of finalizing the right-sizing of the DBAT pool for a now-back-to-normal volume of DDF transactions, but it will avoid the pressure on ESQA that could result from the more-aggressive purging of "too-long-in-the-pool-without-reuse" DBATs that Db2 did prior to the PH36114 fix. It's a good trade-off, in my opinion.

Maybe you now know a few things about DBATs that you didn't know before. Thanks for visiting the blog, and I hope you'll return sometime.

4 comments:

  1. thank you Robert for this useful post. I follow your posts for years and appreciate that you are among the fews that go far over the manuals. These twos QA address my 2 misunderstandings on DBAT. I am testing High Performance DBAT and i think that because of this, the savings is less interesting than the savings of CICS thread reuse as the DBAT is « already » reused …

    ReplyDelete
    Replies
    1. Actually, implementing high-performance DBAT functionality can be a significant performance-booster for a Db2 DDF workload (particularly from a CPU efficiency perspective). True, DDF threads (DBATs) are "already reused," even without utilization of hi-perf DBATs, in the sense that they tend to be reused from the DBAT pool versus Db2 having to create new DBATs to service new transactions (in a given time period, the number of DBATs created by Db2 will usually be way smaller than the number of re-uses of existing DBATs); however, hi-perf DBATs leverage thread re-use by keeping allocated to a DBAT the RELEASE(DEALLOCATE) packages executed via the DBAT, and the table space or partition locks acquired in the execution of RELEASE(DEALLOCATE) packages, until thread termination (generally after 200 re-uses of the thread). Without hi-perf DBAT functionality being in effect, those packages and associated "parent" locks will be released at the end of each transaction. Because DDF transactions often have micro-service characteristics that include very low average in-Db2 CPU time per transaction, the proportionate cost of constantly releasing and reacquiring packages and parent locks can be fairly substantial, and eliminating that cost by way of high-perf DBATs can shave quite a bit off of average per-transaction in-Db2 CPU time.

      Robert

      Delete
  2. Regarding DBAT termination (as mentioned in the above post): Once created, why does a DBAT need to be eventually broken down (i.e., relegated to the poolinac 'status', and ultimately terminated)? What's the downside of keeping them around indefinitely (i.e., keep them inactive, or reuse them)?

    ReplyDelete
    Replies
    1. My understanding is that it has to do with maximizing CPU and virtual storage efficiency for a DDF workload. Suppose a DDF transaction surge takes the number of DBATs up to 1500, and the surge passes and you're back to a transaction volume that can be well-served by 200 DBATs. Getting the number of DBATs trimmed down from 1500 to 200 reduces CPU overhead (managing more threads requires more CPU) and Db2's virtual storage footprint (that can be particularly important with regard to the use of common virtual storage areas such as ECSA, which gets some consumption related to DBATs). If you can automatically and non-disruptively get the number of DBATs reduced (after a transaction surge) to a smaller figure while still having enough for the DDF workload, why not do that? I believe that was the thinking of the DDF development team.

      Robert

      Delete