Recently I've encountered some situations in which organizations running Db2 for z/OS in data sharing mode had lock structures that were not sized for maximum benefit. In this blog entry, my aim is to shed some light on lock structure sizing, and to suggest actions that you might take to assess lock structure sizing in your environment and to make appropriate adjustments.
First, as is my custom, I'll provide some background information.
One structure, two purposes
A lot of you are probably familiar with Db2 for z/OS data sharing. That is a technology, leveraging an IBM Z (i.e., mainframe) cluster configuration called a Parallel Sysplex, that allows multiple Db2 subsystems (referred to as members of the data sharing group) to share read/write access to a single instance of a database. Because the members of a Db2 data sharing group can (and typically do) run in several z/OS LPARs (logical partitions) that themselves run (usually) in several different IBM Z servers, Db2 data sharing can provide tremendous scalability (up to 32 Db2 subsystems can be members of a data sharing group) and tremendous availability (the need for planned downtime can be virtually eliminated, and the impact of unplanned outages can be greatly reduced).
One of the things that enables Db2 data sharing technology to work is what's called global locking. The concept is pretty simple: if an application process connected to member DBP1 of a 4-way (for example) Db2 data sharing group changes data on page P1 of table space TS1, a "local" X-lock (the usual kind of Db2 lock associated with a data-change action) on the page keeps other application processes connected to DBP1 from accessing data on the page until the local X-lock is released by way of the application commit that "hardens" the data-change action. All well and good and normal, but what about application processes connected to the other members of the 4-way data sharing group? How do they know that data on page P1 of table space TS1 is not to be accessed until the application process connected to member DBP1 commits its data-changing action? Here's how: the applications connected to the other members of the data sharing group know that they have to wait on a commit by the DBP1-connected application because in addition to the local X-lock on the page in question there is also a global lock on the page, and that global lock is visible to all application processes connected to other members of the data sharing group.
Where does this global X-lock on page P1 of table space TS1 go? It goes in what's called the lock structure of the data sharing group. That structure - one of several that makes Db2 data sharing work, others being the shared communications area and group buffer pools - is located in a shared-memory LPAR called a coupling facility, and the contents of the structure are visible to all members of the data sharing group because all the members are connected to the coupling facility LPAR (and, almost certainly, to at least one other CF LPAR - a Parallel Sysplex will typically have more than one coupling facility LPAR so as to preclude a single-point-of-failure situation).
Here's something kind of interesting: a global lock actually goes to two places in the lock structure (if it's an X-lock, associated with a data-change action, versus an S-lock, which is associated with a data-read request). Those two places are the two parts of the lock structure: the lock table and the lock list:
- The lock table can be thought of as a super-fast global lock contention detector. How it works: when a global X-lock is requested on a page (or on a row, if the table space in question is defined with row-level locking), a component of the z/OS operating system for the LPAR in which the member Db2 subsystem runs takes the identifier of the resource to be locked (a page, in this example) and runs it through a hashing algorithm. The output of this hashing algorithm relates to a particular entry in the lock table - basically, the hashing algorithm says, "To see if an incompatible global lock is already held by a member Db2 on this resource, check this entry in the lock table." The lock table entry is checked, and in a few microseconds the requesting Db2 member gets its answer - the global lock it wants on the page can be acquired, or it can't (at least not right away - see the description of false contention near the end of this blog entry). This global lock contention check is also performed for S-lock requests that are associated with data-read actions.
- The lock list is, indeed (in essence), a list of locks - specifically, of currently-held global X-locks, associated with data-change actions. What is this list for? Well, suppose that member DBP1 of a 4-way data sharing group terminates abnormally (i.e., fails - and that could be a result of the Db2 subsystem failing by itself, or terminating abnormally as a result of the associated z/OS LPAR or IBM Z server failing). It's likely that some application processes connected to DBP1 were in the midst of changing data at the time of the subsystem failure, and that means that some data pages (or maybe rows) were X-locked at the time of the failure. Those outstanding X-locks prevent access to data that is in an uncommitted state (because the associated units of work were in-flight at the time of the failure of DBP1), but that blocking of access to uncommitted data is only effective if the other members of the data sharing group are aware of the retained page (or row) X-locks (they are called "retained locks" because they will be held until the failed Db2 subsystem can be restarted to release them - restart of a failed Db2 subsystem is usually automatic and usually completes quite quickly). The other members of the data sharing group are aware of DBP1's retained X-locks thanks to the information in the lock list.