Monday, June 24, 2013

DB2 for z/OS Business Analytics: Moving the Queries to the Data

In the world of business, data has become the coin of the realm. Organizations that effectively leverage their informational assets tend to outperform peers and rivals. Given the value of data in today's marketplace, it's no wonder that companies are keenly focused on data security and availability, and on data delivery mechanisms that will keep pace with business growth. For 30 years, firms evaluating data-serving platforms in light of the aforementioned criteria -- security, availability, and scalability -- have chosen DB2 for z/OS and IBM System z servers. A big percentage of the data generated, and otherwise owned, by large enterprises around the world is stored in DB2 for z/OS databases.

Now, having data is one thing, but to get maximum value from this resource, you have to use it -- to identify new market opportunities, to optimize product mix, to manage risk, to better understand customer needs and buying behaviors. Thus the importance of business analytics, which can be thought of as the means whereby information is made to yield insight. For quite a few years, the conventional wisdom held that business analytics was best done on distributed systems platforms; so, organizations wishing to analyze, for decision support purposes, data stored in DB2 for z/OS databases, routinely copied that data to databases running under Linux, UNIX, or Windows on x86- or RISC-based servers. The copied data would then be accessed by various business intelligence tools.

The argument for performing business analytics in this fashion was largely cost-based. People acknowledged that mainframe systems were great as data-of-record repositories -- the super-sized, super-reliable data vaults on which run-the-business applications were built -- but they assumed that System z was overkill, cost-wise, as a platform for query and reporting applications. Interestingly, as LInux-, UNIX- and Windows-based analytics systems grew and proliferated within enterprises, a funny thing happened: the "inexpensive" alternative started to become notably more expensive, as costs mounted in several areas:
  • Source data extract and change replication. Aside from the costs associated with extract and load processing to build an initial query/reporting database on a distributed systems platform, there is the cost of keeping the copied data up to date, as demanded by BI application users. At some sites, growth in the number of query/reporting databases resulted in a change data replication infrastructure that looked something like a spider web when viewed diagrammatically. This infrastructure could end up being quite resource intensive from a support perspective.
  • Environmental factors. Floor space, electricity, and data center cooling costs mount as x86- and RISC-based analytics servers multiply.
  • Data security and integrity. As more and more copies of data sourced from a DB2 for z/OS database pop up in the enterprise, control and management of the data becomes increasingly problematic. Data copies can become out-of-sync with respect to the source, and data security safeguards present on the mainframe system may be lacking on some query and reporting platforms.

Faced with the escalating costs of a "move the data to the query systems" approach, a growing number of companies have opted to go a different route: they are moving the queries to the data. In other words, enterprise data is analyzed where it lives: on the mainframe.

"Moving the queries to the data" is a beautifully simple concept that is grounded in advanced technology and open standards. It starts with the fact that the leading query and reporting tools -- from IBM and from other vendors -- absolutely can be used with DB2 for z/OS. If your preferred business analytics software utilizes JDBC or ODBC to access relational databases (very often the case), you're in good shape, as IBM provides DB2 drivers for those interfaces (and others, as well).

If you query mainframe data on the mainframe (versus copying it to some other platform for analytics purposes), you might still decide to extract data from a source operational database into a query and reporting database that also resides on a System z server. Why? There are several potential reasons:
  • "Data stability" during daytime query hours. Sometimes, users want to get the same result for a given query whether it is executed at 9 AM or at 4 PM on a given day -- in other words, they don't want the data "changing underneath them." In such cases, updates of data in the query and reporting database are processed on a nightly basis.
  • Database design differences. Whereas the data in the source operational tables is likely organized in traditional third normal form, data to be queried via business intelligence tools might need to be arranged in a star schema manner.
  • Data transformation requirements. Data values might need to be transformed in some ways before being loaded or inserted into query and reporting tables.
  • Workload isolation. In particular in a non-data sharing DB2 for z/OS environment, directing analytics users to a set of tables separate from the production operational tables might be done to increase the degree of physical separation between the business intelligence workload and the "run the business" transactional and batch workload. Separate tables could be assigned to separate buffer pools to reduce contention for server memory resources. They could be in a different DB2 subsystem on another z/OS LPAR, perhaps on the same mainframe server, to add CPU isolation to the aforementioned memory isolation which can be achieved by using different DB2 buffer pools (different processors on a System z server can be dedicated to different z/OS LPARs on the server).

Whatever the reason for copying operational DB2 for z/OS data into query and reporting tables, that task is made simpler and more efficient when the source and target tables are both managed by DB2 for z/OS.

In some cases, organizations point query and reporting tools directly at DB2 for z/OS operational tables -- in other words, data copying is not part of the picture. Is that really feasible? Yes, it is, thanks largely to the highly advanced workload management capabilities of z/OS, an operating system that for decades has handled highly differentiated workloads (e.g., transactional and batch) concurrently in a single system image. Having analytics users query the same tables as operational applications can be a particularly attractive option when DB2 for z/OS is running in data sharing mode on a Parallel Sysplex cluster configuration. In that setting, a location alias could be used to restrict business analytics users to (for example) two members of a six-member data sharing group, with the operational applications running on the other four members of the group (also by way of a location alias for DRDA requesters, and via the new subgroup attach feature of DB2 10 for batch jobs, and through the placement of CICS regions for a CICS-DB2 workload). With this type of set-up, the operational and analytics applications could access the same DB2 tables and there would be no contention for memory or CPU resources (assuming that the DB2 members used for the query and reporting workload were running on LPARs in the Parallel Sysplex separate from the LPARs on which the DB2 members handling the operational applications run).

Here's another interesting configuration option for mainframe business analytics: when BI queries involve particularly large scans of data and/or are particularly complex, dramatic reductions in query elapsed times can be achieved through the use of an IBM DB2 Analytics Accelerator. These servers extend the capabilities of a DB2 for z/OS system running a business intelligence workload, delivering "shockingly fast" performance for formerly long-running queries (borrowing a phrase used by an IT manager friend of mine at a site that recently started using DB2 Analytics Accelerator technology), while requiring nothing in the way of changes on the client side of the BI application (users continue to direct their queries to the same DB2 location, and the DB2 optimizer determines whether a given query should be processed locally (i.e., on the DB2 for z/OS front-end) or routed to the Analytics Accelerator. This is way cool technology about which I'll write more soon in a subsequent blog entry.

Topping it all off are the functionality enhancements, delivered in each succeeding release of DB2 for z/OS, that are especially useful in a business analytics context. Just in the last couple of releases (Versions 9 and 10), we've gotten OLAP specifications such as rank, dense rank, and moving aggregates; index-on-expression, which can make predicates with column functions indexable; temporal data support, which enables one to add a time dimension to data in a table; the access path repository, a set of new DB2 10 catalog tables including SYSQUERYOPTS, that enables, among other things, DB2 server-side control of parallelization for dynamic queries at the statement level; a slew of new built-in functions (among them LPAD, MONTHS_BETWEEN, OVERLAY, and TRIM); new data types such as BIGINT, DECFLOAT, and XML; and real-time scoring, which allows for evaluation of newly inserted or updated database values against an SPSS predictive model as part of a transaction's flow, with virtually no impact on transaction elapsed time. And, there's more of this to come with DB2 11 for z/OS, now in beta test.

Is DB2 for z/OS the source DBMS for much of your company's "of record" data? Are you copying this data out to various other platforms for query and reporting purposes? Have you stopped to consider whether moving the queries to the data -- i.e., running your business analytics queries on the source DB2 for z/OS system -- might be a less complicated, more efficient, more secure, and more cost-effective approach? Leveraging DB2 for z/OS and System z as a platform for business analytics has been a winning strategy for many organizations. It might be for yours, as well.

Thursday, June 6, 2013

DB2 10 for z/OS and Enterprise Identity Mapping

You've probably seen this type of commercial many times on television: a frazzled individual vents his or her frustration over the difficulty of accomplishing some task, and finally cries out, "There must be a better way!" At which point the announcer chimes in, with a smooth and confident voice, "There is!" And then you get the product pitch (and it all concludes with the smiling, nodding approval of the formerly rattled, now happy-in-accomplishment user of the promoted product or service).

OK, now imagine you're a security administrator. You've got end users who log into an application server using their network IDs. An application running in that server accesses DB2 for z/OS data via TCP/IP connections (from a DB2 perspective, it is a DRDA requester). In connecting to DB2, the application utilizes a particular ID and password. You had a DBA create a DB2 role and a trusted context to prevent misuse of the application's security credentials, but still you were confronted with a seemingly vexing challenge: for audit purposes, and for purposes of granting DB2 privileges in a more fine-grained fashion versus having all of the application's users exercise the privileges granted to the application's DB2 authorization ID, you want to use RACF IDs to represent application users on the mainframe system. For several reasons, however, the RACF IDs that you use are different in form from the network IDs that the end users utilize for client-side authentication. The users have already done their authentication duty, and you don't want them to have to be concerned with the RACF IDs with which you want to associate their DB2 data access activities; furthermore, you don't want to have to define a different RACF ID for each and every individual application user -- you want the flexibility to assign a particular subset of the users to a single RACF ID.

What to do? Define a different RACF ID for each of the application's users? Ask each user to keep track of, and to use as appropriate, a RACF ID (and password) in addition to his or her network ID? ISN'T THERE A BETTER WAY?!?

Yes, there is. It's called enterprise identity mapping, and it's a capability that you can use with DB2 10 for z/OS. Truth be told, enterprise identity mapping (EIM) could be used in a DB2 9 system, but getting it set up in that environment was a pretty complex undertaking. With DB2 10 (and with some operating system enhancements initially delivered with z/OS 1.11), exploitation of EIM functionality got much easier. In this blog entry, I'll provide a high-level overview of the steps associated with EIM implementation in a DB2 10 setting. More detailed information can be found in the IBM "redbook" titled DB2 10 for z/OS Technical Overview (see section 10.6.2, "z/OS Security Server identity propagation"), and in the DB2 10 for z/OS Managing Security manual (see the information under the heading, "Implementing DB2 support for distributed identity filters"). Basically, there are three parts to this whole.

The RACF part

The key step here is use of the RACMAP command to associate a distributed user identity with a RACF ID. Mike Kearney, a colleague of mine who's a z/OS security expert, leads workshops that cover, among other things, enterprise identity mapping. Mike uses a command like the one following to illustrate the use of RACMAP for EIM (I've changed some of the resource names around):

RACMAP ID(USER1) MAP USERDIDFILTER(name('cn=rsmith,ou=users,o=DSC'))
REGISTRY(name('dg27.dallas.ibm.com:437'))


In this example, 'USER1' is a RACF ID to which you want a distributed user identity to map, 'rsmith' is the distributed user identity of interest, and 'dg27.dallas.ibm.com:437' identifies the registry in which the distributed user name is defined. Note that I could have used a wild card character (an asterisk, or '*') instead of the specific distributed user identity 'rsmith' in the "name" part of the USERDIDFILTER portion of the command, and that's where the flexibility of this approach comes in: had I done that, the RACMAP command above would have enabled RACF to map any distributed user identity defined in the registry 'dg27.dallas.ibm.com:437' to RACF ID 'USER1'; so, I can do 1-to-1 mapping of distributed user identities to RACF IDs, or I can do a many-to-1 mapping, with all distributed user identities defined in a particular registry mapping to a single RACF ID. Use 1-to-1 or many-to-1 mapping, or both, depending on your particular needs.

The DB2 part

As it pertains to DB2 for z/OS, RACF distributed user identity mapping is supported in relation to a DB2 trusted context. That trusted context would define the circumstances in which the privileges granted to a role could be exercised by a user. So, suppose I want a user (or a set of users, referring to the many-to-1 mapping mentioned above) to have the privileges granted to a role, called APPL1_ROLE, when accessing the DB2 database via an application that connects to the DB2 subsystem using the authorization ID 'APPL1SYS', from an application server at IP address 1.2.3.4. Further, I want the end user ID of the person using the application (i.e., the network ID used by that person to authenticate on the client side) to map to RACF user ID 'USER1' (referring to the example in "The RACF part" of this blog entry, above). In that case, my CREATE TRUSTED CONTEXT statement would look something like this:

CREATE TRUSTED CONTEXT APPL1_TRUST                    
BASED UPON CONNECTION USING SYSTEM AUTHID APPL1SYS 
ATTRIBUTES (ADDRESS ‘1.2.3.4’)               
ENABLE                                            
WITH USE FOR USER1 ROLE APPL1_ROLE;


The application part

For a client application to do its part in EIM, it needs to 1) get a trusted context-based connection to the DB2 for z/OS server, and 2) request an authorization ID switch for that trusted context-based connection (in doing that, the application passes to DB2 the distributed user identity and the name of the associated registry). These actions can be accomplished by way of the IBM WebSphere Application Server, if the application executes in that environment. If some other application server that does not provide this functionality is being used, the required actions can still be accomplished by way of APIs for Java, CLI/ODBC, and NET. These APIs support establishing a trusted connection and switching a distributed user identity for such a connection. Here is a link to information about the JDBC APIs for trusted context-based connection creation and distributed user identity switching, in the DB2 10 for z/OS Information Center on the Web:
.
http://pic.dhe.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.java/src/tpc/imjcc_c0023797.htm

And here is a link to information on the CLI/ODBC function used to establish a trusted connection:

http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.apdv.cli.doc/doc/r0000575.html

And a link to the CLI/ODBC function used to switch user identities on a trusted connection:

http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.qb.dbconn.doc/doc/t0023258.html
 

Flexibility without complicating things for end-users

Put it all together, and you have a flexible and powerful means of mapping distributed user identities to RACF IDs, with no requirement for application end users to do anything different than what they're doing today. That's a better way.