Sunday, September 02, 2007

Taking a Stab at a Shared IT Industry Definition of "Data Center Automation"

The problem with certain grandiose terms such as "IT automation" and "data center automation" is that they have no shared definition across vendors in the IT industry. The only thing that's common is their repeated reference by multiple sources, all in different contexts and scope. They become part of the hype vernacular generated by different vendors and their marketing machines and eventually a word’s true meaning becomes irrelevant. In such a state, everyone thinks they know what it is, but no one really does and alas, it becomes so ubiquitous that people don't even bother challenging each others' assumptions regarding its scope.

The term “automation” is rapidly free-falling into just such a state in the IT industry. So here's me taking a stab at level-setting the meaning and scope for automation that (in my humble opinion) is capable of bringing both customers and vendors to the same playing field. Even if not, if it generates some cross-vendor discussion and allows people to challenge each other's perception of what data center automation ought to mean, my purpose would be served.

In a prior blog entry, I refer to 15 specific levels of requirements that need to be addressed for any IT automation solution to be effective. So ladies and gents, here’s that requirements stack. If these requirements are satisfied, you would have reached a state of automation nirvana and somewhere along the way, you would have imbibed (and likely, surpassed) the isolated automation capabilities touted by most IT tools vendors.

The accompanying picture shows Levels 1 to 8 as a pre-requisite to automation, and works itself all the way up to level 15 (autonomics) via a model that fosters shared intelligence across disparate capabilities and functions. It is important to get the context right with each of these levels before assuming one has attained them. And until one has got a particular level right, it is often futile to attempt to go to the next level. Short-cuts have an uncanny way of short-circuiting the process. That's why you find so many vendors and organizations with fragmented notions of these requirements just not cutting it in real-world automation deployments. Specifically because such approaches lack a sound methodology to build the proper foundation.

Nuff said. Let’s look at the individual levels now and how they build on one another.

Level 1 pertains to achieving 360-degree monitoring. This capability breaks through the typical siloed monitoring that exists today in many environments and allows problems to be viewed across multiple tiers, applications and service stacks in a cohesive manner –preferably, in the same call sequence utilized by the end-user application. Current monitoring deployments often remind me of the old fable of the seven blind men and the elephant (where each blind man would feel a different part of the elephant and perceive the animal to resemble a familiar item. For instance, one would touch the elephant’s tail and spread the notion that the animal looks like a rope, whereas another person would feel the elephant’s foot and try to convince everyone that the animal resembled a pillar). Lack of a comprehensive and consistent view of the same issue by multiple individuals and groups cause more delays in solving problems than the typical IT manager would admit.

Levels 2 and 3 utilize 360-degree monitoring for proper problem diagnosis, triage and alerting . Rather than leaving the preliminary diagnosis regarding nature and origin of a problem to human hands and eyes (say, a Tier 1 or Help Desk team), the monitoring software should be able to examine the problem end-to-end, carry out sufficient root cause analysis, narrow down the scope based on the analysis and send a ticket to the right silo/individual. Such precision alerting reduces the need for manual decision-making (and chances of error) regarding which team to assign a ticket to and positively impacts metrics such as first-time-right and on-time-delivery.

Level 4 pertains to ad-hoc tasks that administrators often do, such as adding a user or changing the configuration parameters of an application. There are lots of popular tools and point solutions for Levels 1 to 4 ranging from monitoring tools to ad-hoc task GUIs. However their functionality really ends there (or they try to jump all the way from Level 4 to Level 9 (Automation), skipping the steps in between and causing the resultant automation to be of rather limited use.)

Level 5 questions the premise of an “ad-hoc task”. Wisdom from the trenches often tells us that there is no such thing a one-time / ad-hoc task. Everything ends up being repeatable. For instance, when creating a user on one database, one may bring up her favorite ad-hoc task GUI, click here, click there, type in a bunch of command attributes and then hit the Execute button. That works for creating one user. However when the exact same user needs to be rolled out on 20 servers, it involves a lot of pointing, clicking and typing and leaves the environment vulnerable to human errors. Suddenly the ad-hoc GUI ceases to be effective.

Requirement levels 5 to 9 address this by calling for standard operating procedures (SOPs) that can be applied across multiple environments from a central location. Level 6 requires diverse sub-environments to be categorized such that different activities pertaining to them (including all service requests and incident remediation efforts) can all have standard task recipes. This refers to rolling up disparate physical environments into fewer logical components based on policies and usage attributes including service level requirements.

Task recipes, once defined, need to be maintained in a central knowledgebase and need to be in a format wherein they can easily serve as a blue-print for any subsequent automation. Further, the task recipes need to be directly linked to automation routines/workflows in a 1:1 manner such that one can reach the workflows from the SOP and vice versa. In other words, there needs to be a shared intelligence between the SOP and the automation routine. Keeping the two separate (for instance, keeping the task recipes on a SharePoint portal and keeping the automation routines within a scheduling or workflow tool, or worse, as a set of scripts distributed locally on the target servers) without any hard-wired connection and tracking between the two mediums will make it easy for task recipes and automation routines to get out of sync. Such automation tends to be uncontrolled; each user of such automation is left to his/her own devices to leverage it as optimally as possible, and eventually, its utility becomes questionable at best with each user (administrator) customizing the automation routines to their particular environment and their individual preferences. Hitherto noble notions such as shared intelligence and centralized control across task recipes and automation routines (and eventually, consistency in quality of work) go out the window!

With shared intelligence comes the ability to track and enforce standard task recipes across different personnel and environments. Level 10 states this exact requirement - the ability to maintain a centralized audit trail describing when an automation routine ran, on what server, who ran it (or what event triggered it), what were the run-results and so on – for ongoing assessment of SOP quality, pruning the task recipes and corresponding automation code, and finally, to ensure enterprise-wide adherence to best practices.

Level 11 or virtualization allows automation to be applied in a easier manner across the different environment categories defined in level 6. It does so via a “hypervisor” that abstracts and masks the different nuances across multiple categories by having a SOP call wrapper applying a task across different environment types (ideally, done via an expert system). Within that single SOP call, auto-discovery capabilities identify the current state of a target server or application, evoke a decision tree to determine how to best perform a task and then finally, call the appropriate SOP (or sub-SOP, as the case may be) to carry out the task in the pre-defined and pre-approved manner that’s most suited to that environment. In other words, level 11, much like storage virtualization that occurs within a SAN (think EMC Invista, not VMWare), calls for multiple disparate environments to be viewed as a single environment (or at the very least, a smaller subset of environments) and dealt with in an easier manner.

Levels 12 to 15 allow increasingly sophisticated levels of analysis and intelligence to be applied to a target environment, including event correlation, root cause analysis, and predictive analytics to be able to discern problems as they occur or ideally, before they occur and link those back to one or more SOPs which can be automatically triggered to avert an outage, performance degradation or policy violation and retain status quo, thereby making the target environment more autonomic.

Granted that many companies and administrators may not have the appetite to go all the way up to level 15 for most administrative tasks. But regardless, I would hope that this model provides a clear(er) roadmap for automation than just different groups and individuals scrapping together a bunch of proprietary scripts and Word documents, and keeping relevant information on how to exactly execute them within their heads or worse, each group relying on a bunch of disjointed tools, vendors and promises, resulting in a fragmented automation strategy and chaotic (read, unmeasurable) results.

Maybe unmeasurable results are not such a bad thing, especially if you are a large and well-established software vendor, who has little value to deliver, but thrives on FUD to keep the market confused.

If you are an IT manager or administrator and have a favorite software tools vendor, ask them how their automation strategy stacks up against these 15 levels. Drop me a note if you get an answer back.


Anonymous said...

Recently I really needed some help with ITIL automation. I needed something functional and comprehensive that would help me enforce and realize my ITIL initiatives. I looked around online a lot and now I’m blogging about it to see if I can find some helpful tips or at least be pointed in the right direction.

IT automation is really big right now (and it should be) and the network that I will be managing is pretty large and complicated. There are lots of crashes and bugs related to the business software applications that the company runs so they’re looking to me to fix things up. Any ideas anyone?

JerryLovesFlyfishing said...

Your situation calls for a more detailed analysis. The author of this blog, Venkat Devraj came out to our site a little while ago to do an assessment and suggested some solutions that worked well for us. I would recommend the same approach.