Thursday, May 01, 2008

Six Ways to Tell if an RBA Tool Has Version 2.0 Capabilities

Here’s an update to my last blog about RBA version 2.0. I promised a reader I would provide a set of criteria for determining whether a given RBA platform offers version 2.0 capabilities, i.e., the ability to create and deploy more dynamic and intelligent workflows that can evolve as the underlying environment changes, as well as accommodate the higher flexibility required for advanced IT process automation. In this piece, I expound mostly on the former requirement since my prior blog talked about the latter in quite some detail.

If you don’t have time to read the entire blog post, here’s the exec summary version of the six criteria:
1. Native metadata auto-discovery capabilities for target environmental state assessment
2. Policy-driven automation
3. Metadata Injection
4. Rule Sets for 360-degree event visibility and correlation
5. Root cause analysis and remediation
6. Analytics, trending and modeling.

Now, let’s look at each of these in more detail:

1. Native metadata auto-discovery capabilities for target environmental state assessment. In this context, the term “metadata” refers to data describing the state of the target environment where automation workflows are deployed. Once this data is collected (in a centralized metadata repository owned by the RBA engine), it effectively abstracts target environmental nuances and can be leveraged by the automation workflows deployed there to branch out based on changes encountered. This effectively reduces the amount of deployment-specific hard-coding that needs to go into each workflow. This metadata can also be shared with other tools that may need access to such data to drive their own functionality. Most first-generation RBA tools do not have a metadata repository of their own. They rely on an existing CMDB to pull relevant metadata or in the absence of a CMDB, the workflows attempt to query such data at runtime.

As I have mentioned in my earlier post, the existing metadata in typical CMDBs only scratches the surface for the kind of metadata required to drive intelligent automation. For instance, in the case of a database troubleshooting workflow, the kind of information you may need for automated remediation can range from database configuration data to session information to locking issues. While a CMDB may have the former, you would be hard-pressed to get the latter. Now, RBA 1.0 vendors will tell you that their workflows can collect all such metadata data at runtime, however some of these collections may require administrative/root privileges. Allowing workflows to run with admin privileges (esp for tasks not requiring such access) can be dangerous (read, potential security policy or compliance violation). Alternatively, allowing them to collect metadata during runtime makes for some very bulky workflows that can quickly deplete system resources on the target environments especially in the case of workflows that are run frequently. Not ideal.

The ideal situation is to leverage built-in metadata collection capabilities within the RBA platform to identify and collect the requisite metadata in the RBA repository (start with the bare minimum and then add additional pieces of metadata, as required by the incremental workflows you roll out). If there is a CMDB in place, the RBA platform should be able to leverage that for configuration-related metadata and then fill in the gaps natively.

Also, the RBA platform needs to have an “open collection” capability. This means, the product may come with specific metadata collection capabilities out-of-the-box. However if the users/customers wish to deploy workflows that need additional metadata which is not available out of the box, they should be able to define custom collection routines.

2. Policy-Driven Automation. Unfortunately, not all environmental metadata can be auto-discovered or auto-collected. For instance, attributes such as whether an environment is Development, Test or Production or what its maintenance window is, etc. are useful pieces of information to have since the behavior of an automation workflow may have to change based on these attributes. However given the difficulty in discovering these, it may be easier to specify them as polices that can be referenced by workflows. Second generation RBA tools have a centralized policy layer within the metadata repository wherein specific variables and their values can be specified by users/customers. If the value changes (say, maintenance window start time changes from Saturday 9 p.m. Mountain Time to Sunday 6 a.m. Mountain Time, then it only needs to be updated in one area (the policy screen) and all the downstream workflows that rely on it get the updated value.

3. Metadata Injection. Here’s where things get really interesting. Once you gather relevant metadata (either auto-discovered or via policies), there needs to be a way for disparate automation workflows to leverage all that data during runtime. The term “metadata injection” refers to the method by which such metadata is made available to the automation workflows - especially to make runtime decisions and branch into pre-defined sub-processes or steps.

As an example of how this works, the Expert Engine within Stratavia’s Data Palette parses the code within the workflows steps for any metadata variable references and then substitutes (injects) those with the most current metadata. Workflows also have access to metadata history (for comparisons and trending; more on that below).

Here’s a quick picture (courtesy of Stratavia Corp) that shows differences in how workflows are deployed in an RBA version 1 tool versus RBA 2.0. Note how the metadata repository works as an abstraction layer in the latter case, not only abstracting environmental nuances but also simplifying workflow deployment.


4. Rule Sets for 360-degree event visibility and correlation. This capability allows relevant infrastructure and application events to be made visible to the RBA engine – events from the network, server, hypervisor (if the server is a virtual machine), storage, database and application need to come together for a comprehensive “360-degree view”. This allows the RBA engine to leverage Rule Sets for correlating and comparing values - especially for automated troubleshooting, incident management and triage.

Further, an RBA 2.0 platform should be capable of handling not just simple boolean logic and comparisons within its rules engine, but also advanced Rule Sets for correlations involving time series, event summaries, rate of change and other complex conditions.

5. Root cause analysis and remediation. This is one of my favorites! Accordingly to varied analyst studies, IT admin teams spend as much as 80% of problem management time trying to identify root cause and 20% in fixing the issue. As many of you know, root cause analysis is not only time consuming, but also places significant stress on team members and business stakeholders (as people are under the gun for quickly finding and fixing the underlying issue). After numerous “war-room” sessions and finger-pointing episodes, the root cause (sometimes) emerges and gets addressed.

A key goal of second-generation RBA products is to go beyond root cause identification to automation of the entire root cause analysis process. This is done by gathering the relevant statistics in the metadata repository to obtain a 360-degree view (mentioned earlier), analyzing them via Rule Sets (also referred above) and then providing metrics to identify the smoking gun. If the problem is caused by a Known Error, the relevant remediation workflow can be launched.

Many first-generation RBA tools assume this kind of root cause analysis is best left to a performance monitoring tool that the customer may already have deployed. So what happens if the customer doesn’t have one deployed? What happens if a tool capable of such analysis is deployed, but not all the teams are using that tool? Usually, each IT team has its own point solution(s) that looks at the problem in a silo’d manner, which commonly leads to finger-pointing. I have rarely seen a single performance analysis tool that is universally adopted by all IT admin teams within a company and everyone working off the same set of metrics to identify and deal with root cause. If IT teams have such a tough time identifying root cause manually, should companies just forget about trying to automate that process? Maybe not… With RBA 2.0, regardless of what events disparate monitoring tools may present, such data can be aggregated (either natively and/or from those existing monitoring tools) and evaluated via a Rule Set to identify recurring problem signatures and promptly dealt with via a corresponding workflow.

All that most monitoring systems do is present a problem or potential problem (yellow/red lights), send out an alert and/or run a script. Combining root cause analysis capabilities and automation workflows within the RBA platform helps improve service levels and frequently reduces alert floods (caused frequently by the monitoring tools), unnecessary tickets and incorrect escalations.

Improving service levels – what a unique concept! Hey wait a minute, isn’t that what automation is really supposed to do? And yet, it amazes me how many companies don’t want to go there and just continue dealing with incidents and problems manually. RBA 2.0 begins to weaken that resistance.

6. Analytics, trending and modeling. These kind of capabilities are nice to have and if leveraged from an RBA context, can be really powerful. Once relevant statistics and KPIs are available within the metadata repository, Rule Sets and workflows should be able to access history and summary data (pre-computed based on predictive algorithms and models) to be able to understand trends and patterns to deal with issues before they become incidents.

These can be simplistic models (no PhD required to deploy them), and yet avoid many performance glitches, process failures and downtime. For instance, if disk space is expected to run out in the next N days, it may make sense to automatically open a ticket in Remedy. But if space is expected to run out in N hours, it may make sense to actually provision incremental storage on a pre-allocated device or to perform preventative space maintenance (e.g., backup and delete old log files to free up space, etc.), in addition to opening/closing of the Remedy ticket. A good way to understand these events and take timely action is to compare current activity with historical trends and link the outcome to a remediation workflow.

5 comments:

Anonymous said...

Venkat,

Thanks for the these insights.
Really good!

I have a couple of questions which slightly digress but who better than you to ask!:

In your experience what has been the adoption level of outsourcing firms for RBA products?
In scenarios where enterprises purchase an RBA product, as most of them running an insourced Data Center Operations?

Venkat Devraj said...

Ashutosh,
Thanks for your questions, and sorry about the delay in responding. I just got back from a (somewhat short) vacation overseas and am still struggling to lose the jet-lag! Anyway, here’s my take on the areas you bring up:

Question 1: "In your experience what has been the adoption level of outsourcing firms for RBA products?"
My Response: It tends to vary based on the outsourcing firm. For instance, my company Stratavia has been working with some of the larger US-based ones in helping them compare and evaluate the right RBA solution for their customers. On the other hand, we haven’t engaged much with the offshore outsourcing firms (not because of lack of trying on our part!). Since labor is relatively cheap there (currently), they tend to subscribe to the notion they can build it themselves rather than buy. Contrary to its deceptively simple appearance, an RBA product is not just a “workflow-based scripting tool”. Especially in the case of RBA version 2, there is a powerful engine managing and driving these workflows, managing events, abstracting environmental nuances, auditing usage, and so on. This engine is what takes a lot of time to build right. Given the outsourcers’ affinity for projects they can monetize in the short term, my opinion is they should focus on their core competency i.e., generating revenue through services and leave building commercial software to firms that do so for a living.

The other interesting trend I’m seeing with some of the outsourcers is that they sometimes engage with multiple RBA vendors and run a so-called formal process (frequently referred to as a “lab evaluation”). However that process frequently turns into a pure R&D effort with the outsourcer team members speculating about different types of RBA scenarios and how a product should handle these. These “projects” rarely see the light of day. They go on for several months, sometimes years and flounders around without a dedicated team. (They are run by technicians who roll on and off the evaluation process due to customer billing responsibilities.) There is no formal project manager nor are there any well-established business drivers or crisp success criteria. I personally was involved in such an effort a year ago and it was a colossal waste of time for everyone involved! (The words “lab evaluation” make me wake up in a cold sweat even today…)

The outsourcing firms that have successfully implemented RBA solutions tend to do this one client at a time. They present the vendor(s) with specific use-cases they wish to automate for a specific IT team within a chosen customer (use cases such as server provisioning, application patching, database refreshes, etc.) and evaluate the results based on pre-defined success criteria.

So to answer your question, in my experience the adoption level seems higher for US-based outsourcing firms, whereas the offshore ones tend to be somewhat stuck in their lab evaluations and internal product build-out attempts.

Question 2: "In scenarios where enterprises purchase an RBA product, as most of them running an insourced Data Center Operations?"
My Response: Good question, but the reverse seems to be true. For example, more than half of Stratavia’s customers are fully or partially outsourced to a Top 10 outsourcing vendor. Why would a company with outsourced IT purchase an RBA product? For starters, such organizations especially ones that have utilized outsourcing for multiple years, have reached a level of IT maturity where they realize cheaper labor and/or better processes by themselves are not enough to sustain enhanced service levels. To quote one of my customers from a Fortune 10 company, all too frequently, outsourced companies end up getting “the mess for less”. RBA products guarantee conformance to established processes and enable consistency in service quality across multiple bodies. It allows more complex work to be pushed downstream to less expensive IT resources such Help Desk personnel or off-shore teams – without service level degradation. Clearly a winning proposition for everyone involved. If the outsourcing partner is not bringing in this kind of innovation, customer CIOs themselves are writing the check and expecting the outsourcer to leverage the technology. Why? Because the improvement in service levels pays for it. And customers do expect a corresponding reduction in service costs from the outsourcing company over a period of time. But most importantly, owning the RBA technology allows the customer to own the IP and avoid outsourcing vendor lock-in. The “tribal knowledge” traditionally in the heads of the outsourcing vendor’s personnel is now transferred to the product. This gives the customer significant leverage with the outsourcing vendor during monthly/quarterly performance reviews and especially during contract renewal time.

Outsourcing vendors need to wake up and smell the coffee. Rather than viewing efficiency-enabling technologies as a threat to their revenue model (read, billing for bodies), they need to view these technologies as a way to introduce higher innovation and thought-leadership to their customer-base and delight them with demonstrably high service levels. An outsourcing company’s investment in RBA technologies will more than pay for itself via better customer satisfaction levels and contract renewal rates, not to mention better margins in fixed priced infrastructure support contracts.

Hope this helps.

-- Venkat

Anonymous said...

Venkat

Thanks for your comments. I am not sure when you replied as I did not get a mail alert but happened to stroll by your blog and found your response. Your comments surely help.

Working for one of the largest offshore based IT Infra players I can surely understand your views. Many companies are averse to investing in tools if there are ways to do the same without the tool with perhaps more effort. It is all about the perceived value - if a pair of hands can do what they understand RBA delivers, they may find it better to get that pair of hand who can probably do some other things as well.

Of course, the depth and accuracy and comprehensiveness which automation brings may not be delivered but unless the contract and the customer does not incentivize value-through-automation, most do not see the killing need to tread this path. Also all said and done, the shrinking labor arbitrage still holds. So there is no real motive to go for RBA for most.

Apart from customer drivers and a non-existent labor arbitrage, the other things that may force the players is peer pressure. I do not see that happening either. I know many of the biggies in IT outsourcing (non-offshore based players) are also pretty conservative in adopting tools like these. They will surely go for them sooner than later ( I personally believe it is inevitable) though most are also leveraging offshored based delivery locations to tap the cheap labor pool. The other driver will be the tool framework guys like HP and BMC touting their new wares soon after they integrate it into their product portfolios but again, it will probably be to enterprises directly than to outsourced service providers in terms of higher traction.

It is very very interesting to note that many of your customers have outsourced their operations. I did not anticipate that. These companies I would suspect would also be the early adopters and probably higher in terms of their IT IQ than others.

Thanks again and look forward to your posts in future.

Ashutosh

Tarry Singh said...

Excellent article there, Venkat!

I'd love to cover you on the RBA when possible.

-Tarry

Venkat Devraj said...

Tarry - Thanks for the comments. Happy to talk with you in more detail. You can reach me at vdevraj at stratavia dot com.