A REAL Disaster Recovery

Nicolas Raymond ~  Free Grunge Textures

How to Destroy a Sysplex

To say we had an interesting Business Recovery Exercise this week would be an understatement!

Since bringing our BR (business recovery) / DR (disaster recovery) solution in house, rather than performing offsite, we’ve had a total of five BR Exercises this year alone.  This is pretty impressive for our shop since we use to go YEARS between BR Exercises.  Now our clients can declare a BR Exercise without prior notice to ensure our infrastructure is sound and solid.

Our infrastructure IS sound and solid…provided no one  messes with it!

Two months earlier I was doing what I thought was helpful clean up on RACF.  I was adding a new PROFILE for a monitoring application.  Our RACF expert had just recently retired and our new RACF person was not quite trained and up to speed.

On occasion I would go in and “fix up” some things in RACF trying to helpful.  Although I had ADMIN rights to reset PASSWORDS when I’m on-call,  I’m not really suppose to mess around in RACF.

But what’s the worse that can happen?

I honestly thought I was doing something good by deleting a VERY suspicious * (G)ENERIC profile.

Disaster Recovery_RACF_profiles

(* I have my very own screen shot auditing script that captures my screen every minute on my workstation.  It was able to capture the quiet destruction of the sysplex.)

To me this generic profile seemed a security risk and decided to take matters into my own hands (since the new guy surely was not going to) and DELETED this profile!

Disaster Recovery_RACF_delete

Ah oh!!!

Disaster Recovery_RACF_warning

“You still have a chance to undo this Paul!!!”

Disaster Recovery_RACF_refresh

Nope.  Profile is deleted.

Quiet Disaster

What I didn’t realize what I had done is that instead of making the system more secure I delete a VERY important PROFILE that’s used at IPL.

As Michael Cairns excellently describes in his article “Addressing Common RACF Configuration Issues“, that * GENERIC PROFILE was the catchall profile.

[The] class SURROGAT profile consisting simply of "**" or "*.*" (sometimes called a catchall profile). It applies to all user IDs that aren't matched by a more specific profile and probably covers your user ID unless steps have been taken to avoid this.
...
Without a catchall generic profile of some kind in the class STARTED, a previously undefined started task will fall back to the contents of ICHRIN03. 
...
If fallback to ICHRIN03 can happen, you need to know what privileges it's granting.

That’s exactly what happened.

We started the Business Recovery Exercise and the system upon the first IPL came to a screeching halt.  Apparently JES2 (Job Entry Subsystem) did not have authority and the ICHRIN03 was poorly coded.

But…NOTHING has changed!!!!

Imagine the frustration my fellow colleagues (and myself before discovery) were experiencing.  Here we were doing our FIFTH BR exercise this year.  It always worked.  It never failed.  We had a perfect mirror of our working production.  Nothing had changed!

To make a long story (and painful one for me) short, we opened a Service Request with Severity 1 with IBM.  This is equivalent to calling 911 or pressing the nuclear panic button when you need IBM support and need it fast!

We were directed to a teleconference with their JES and RACF experts and with their AWE INSPIRING expertise guide us to the discovery that yes, we were missing that * GENERIC PROFILE in RACF.  Since JES2 at our shop started in a certain sequence we were unable to re-create this PROFILE on our BR system.

Since this was a mirror of our production we discovered that we were in fact vulnerable on our PRODUCTION SYSTEM!!!

If we had IPL’d any of our production LPARS, meaning recycling them, there was NO WAY they were coming back up.  JES2 would have ran into the same authority issue error and the entire system would be in a matter of speaking…toast!

Luckily we caught this and were able to RECREATE the profile on our PRODUCTION system so we could mirror it over to the BR SYSTEM and finish the exercise.

Take away lessons:

  1. NEVER…  EVER…   MESS WITH RACF! (At least without knowing what you’re doing.  My RACF roles have been relinquished to the appropriate people.)
  2. Business / Disaster Recovery Exercises are there for a REASON!  If you’re not doing it at your shop, how do you know you’re not vulnerable?

</CONFESSION AND LESSON>

3 thoughts on “A REAL Disaster Recovery”

  1. There are even subtler ways to shoot yourself in the foot. Literally five years before our near-DR event, I had updated a SCHED00 PPT entry for JES2 with attributes that made JES2 dependent on RACF to read his own HASPPARM file. We were OK for five years until someone *else* made a RACF change that inavdvertently removed SYS1.HASPPARM from a generic profile that had previously covered it. So at the next IPL–in production–JES2 got S913 at start-up. We got out of it somehow, but the lesson learned was not to include JES2 in SCHED00. IBM’s default PPT entry, which makes the task privileged, is just fine. Advice: unless you have some overwhelming reason, do not include *any* IBM-provided PPT entry in PARMLIB.

  2. And whatever you do with your catchall profile in STARTED, do not under any circumstances give it an STDATA segment with the specification USERID=MEMBER unless you also code a GROUP=some-group, where some-group exists, has no userids connected, and has no RACF access rights granted via PERMIT’s anywhere… Having just USERID=MEMBER with nothing else specified leaves you open to a JES Procedure privilege escalation hack.

    Cheers – Mike Cairns
    http://www.racfconsultant.com

Leave your comment here!