Friday, December 28, 2007

Gartner Show Highlights Need for More Intelligence in Data Center Automation

Gartner’s recent Data Center Conference was a blast! Between technical sessions, booth duty and a joint presentation on Data Center Economics with a client (Xcel Energy), I managed to sneak in a round of golf at Bali Hai (first time I ever played with a caddie…). While attendance at the event (3 to 4K maybe) was certainly not the highest in recent events, the conference sufficiently put the spotlight on 3 key areas: virtualization, automation and green data centers (in that order). Having been seen as the red-headed stepchild for the longest time (as compared to virtualization), it was good to see automation finally beginning to garner its fair share of attention (and IT budgets!).

There were dozens of tools and technologies with an automation theme. Not surprisingly, automation is also basking in the glory of the ripple effects from virtualization and green data centers. With virtualization boosting the number of “server environments”, each of which need to be individually monitored and managed, automation is being increasingly regarded as the other side of the virtualization coin. Similarly, with the awareness for green data centers increasing, I’m beginning to see innovation even in areas as obscure as automating the hibernation/shutdown of machines during periods of non-use to reduce energy consumption.

So given the plethora of tools and products, is every automation capability that is needed out there already invented and available? Not so. One of the biggest things lacking is “intelligence” in automation. Take for example run book automation toolsets. While many of them offer a nice GUI for administrators to define and orchestrate ITIL processes, they inherently introduce a degree of rigidness as well in the process. For instance, if there are 3 run books or process workflows that deal with server maintenance, the Maintenance Window is often hard-coded within each of the workflows. If the Window changes, all 3 of the workflows have to be individually updated. That creates a significant maintenance overhead. Similarly, if a workflow includes a step to open a ticket in BMC Remedy, the ticketing system’s details such as product type and version are often hard-coded. If the customer upgrades the Remedy ticketing system or migrates from Remedy to HP Peregrine, the workflow doesn’t function anymore! An intelligent process workflow engine would avoid these traps.

An intelligent automation engine is often characterized by the following attributes:

  • Centralized policy-driven automation – Policies allow human input to be recorded in a single (central) location and made available to multiple downstream automation routines. The Maintenance Window example above is a great candidate for such policy-driven automation. Besides service level policies (such as the Maintenance Window), areas such as configuration management, compliance, and security are well suited for being cast as policies.


  • Metadata injection & auto-discovery – Metadata is data describing the state of the environment. It is important for automation routines to have access to state data and especially be notified when there is a state change. For example, there is no point in starting the midnight “backup_to_tape” process as a 4-way stream when 2 of the 4 tape drives have had a sudden failure or are offline. The automation engine needs to be aware of what is available so it can launch dependent processes to optimally leverage existing resources. Such state data can be auto-discovered either natively, or via an existing CMDB, if applicable.


  • Event correlation and root cause analysis – Ability to acknowledge and correlate events from multiple sources and being able to leverage this information to identify problem patterns and root cause(s) would make automated error resolution more precise.


  • Rules processing – Being able to process complex event-driven rule-sets (not just simple boolean logic) allows triggering of automation in the right environments at the right times.


  • Analytics and modeling – Being able to apply dynamic thresholding as well as, analytical and mathematical models to metadata to discern future resource behavior is key for averting performance hiccups and application downtime.


  • During the Gartner show, I walked by many of the booths looking at what these supposedly “bleeding edge” vendors offered. Suffice it to say, I wasn’t dazzled. Rather than providing just asinine glue (which unfortunately, is what most of the current crop of process automation and orchestration tools are reduced to) to piece together multiple scripts and 3rd party tool interfaces and refer to it as “automation”, the customers I work with are increasingly interested in seeing offerings that leverage the above capabilities. Other than Data Palette, I don’t know of any products that do so. But then, you already knew that!

    Wednesday, October 17, 2007

    Data Palette 4.0 is out! Why should you care?

    On October 1, Stratavia released version 4 of Data Palette. I have been looking forward to this event for the following reasons:

    1. Version 4 extends Data Palette’s automation capabilities to areas outside the database - to cover the entire IT life-cycle. This extension centers around three core areas:

    • provisioning, including server, storage and application provisioning;
    • service request and change management automation (e.g., operating system patches, application rollouts, system upgrades, user account maintenance, data migration, etc.); and
    • alert remediation and problem avoidance.

    2. It expands Data Palette’s predictive analytics to storage management and capacity planning.

    3. Further, it allows using these predictions, combined with root cause data and event correlating rule sets to enable decision automation for fast-changing heterogeneous environments.

    I realize this release of Data Palette casts a rather wide net in the data center automation realm. However this is not mere bravado from Stratavia and it is certainly not mere coolness (“coolness” being a bunch of widgets that make for good market-speak, but don’t quite find their way into customer environments). If anything, it reflects the fact that data center automation is just not doable without a broad spectrum of capabilities. For an automation strategy to succeed, it simply needs to cast a comprehensive net, connecting all the diverse areas of functionality via a shared back-bone (the latter being especially key).

    Let’s look at a couple of contrasts to understand this better. Take Opsware for example. They started out with basic server provisioning and then realized there’s more to automation. So they gobbled up a few more vendors (Rendition Networks for network provisioning, Creekpath Systems for storage provisioning and iConclude for run book automation) that couldn’t quite hack it on their own. Then before Opsware could expand any further, they ended up being bought by HP.

    Another example is BMC – a relatively late entrant to the IT automation space. After wallowing in enterprise monitoring and incident management tool-sets for years, BMC made headlines a few months ago when it acquired run book automation vendor RealOps. It followed that up with the recent purchase of network automation provider Emprisa. I’m sure we will continue to see more purchases from HP, BMC and others as they continue to fill out the automation offerings in their portfolio (and try to stick them all together with bubble-gum, prayers and wads of marketing dollars!)

    Buying a bunch of siloed (read, disjointed) tools does not an automation solution make! Why? Because there is no shared intelligence across all these tools. There is no singular policy engine commonly referenced by all of them. There are no central data collection, event detection and correlation abilities. When you define policies or rules in one tool, you often need to repeat the same definition in different ways in the other tools, significantly increasing maintenance overhead. When a policy or rule changes, one needs to go to N different places (N being all the locations where they are defined separately) – assuming it is even feasible to recall exactly all the areas that need to be updated. If you have more than a couple of dozen automation routines in place, good luck trying to reconcile what these things do, what areas they touch and what policies are duplicated across them.

    In the case of Data Palette, this issue is addressed by its central Expert Engine architecture. This engine allows shared use of policy and metadata definitions. They get defined once and can be referenced across multiple automation routines and event rule sets. In other words, all the different areas of functionality and touch-points speak the same language.

    Stratavia has dubbed its 4.0 release as the industry’s “most intelligent” data center automation platform. Obviously intelligence can be a subjective term. I actually see it as the most comprehensive data center automation platform of its kind (look Ma, no bubble-gum…).

    Sunday, September 02, 2007

    Taking a Stab at a Shared IT Industry Definition of "Data Center Automation"

    The problem with certain grandiose terms such as "IT automation" and "data center automation" is that they have no shared definition across vendors in the IT industry. The only thing that's common is their repeated reference by multiple sources, all in different contexts and scope. They become part of the hype vernacular generated by different vendors and their marketing machines and eventually a word’s true meaning becomes irrelevant. In such a state, everyone thinks they know what it is, but no one really does and alas, it becomes so ubiquitous that people don't even bother challenging each others' assumptions regarding its scope.

    The term “automation” is rapidly free-falling into just such a state in the IT industry. So here's me taking a stab at level-setting the meaning and scope for automation that (in my humble opinion) is capable of bringing both customers and vendors to the same playing field. Even if not, if it generates some cross-vendor discussion and allows people to challenge each other's perception of what data center automation ought to mean, my purpose would be served.

    In a prior blog entry, I refer to 15 specific levels of requirements that need to be addressed for any IT automation solution to be effective. So ladies and gents, here’s that requirements stack. If these requirements are satisfied, you would have reached a state of automation nirvana and somewhere along the way, you would have imbibed (and likely, surpassed) the isolated automation capabilities touted by most IT tools vendors.

    The accompanying picture shows Levels 1 to 8 as a pre-requisite to automation, and works itself all the way up to level 15 (autonomics) via a model that fosters shared intelligence across disparate capabilities and functions. It is important to get the context right with each of these levels before assuming one has attained them. And until one has got a particular level right, it is often futile to attempt to go to the next level. Short-cuts have an uncanny way of short-circuiting the process. That's why you find so many vendors and organizations with fragmented notions of these requirements just not cutting it in real-world automation deployments. Specifically because such approaches lack a sound methodology to build the proper foundation.

    Nuff said. Let’s look at the individual levels now and how they build on one another.

    Level 1 pertains to achieving 360-degree monitoring. This capability breaks through the typical siloed monitoring that exists today in many environments and allows problems to be viewed across multiple tiers, applications and service stacks in a cohesive manner –preferably, in the same call sequence utilized by the end-user application. Current monitoring deployments often remind me of the old fable of the seven blind men and the elephant (where each blind man would feel a different part of the elephant and perceive the animal to resemble a familiar item. For instance, one would touch the elephant’s tail and spread the notion that the animal looks like a rope, whereas another person would feel the elephant’s foot and try to convince everyone that the animal resembled a pillar). Lack of a comprehensive and consistent view of the same issue by multiple individuals and groups cause more delays in solving problems than the typical IT manager would admit.

    Levels 2 and 3 utilize 360-degree monitoring for proper problem diagnosis, triage and alerting . Rather than leaving the preliminary diagnosis regarding nature and origin of a problem to human hands and eyes (say, a Tier 1 or Help Desk team), the monitoring software should be able to examine the problem end-to-end, carry out sufficient root cause analysis, narrow down the scope based on the analysis and send a ticket to the right silo/individual. Such precision alerting reduces the need for manual decision-making (and chances of error) regarding which team to assign a ticket to and positively impacts metrics such as first-time-right and on-time-delivery.

    Level 4 pertains to ad-hoc tasks that administrators often do, such as adding a user or changing the configuration parameters of an application. There are lots of popular tools and point solutions for Levels 1 to 4 ranging from monitoring tools to ad-hoc task GUIs. However their functionality really ends there (or they try to jump all the way from Level 4 to Level 9 (Automation), skipping the steps in between and causing the resultant automation to be of rather limited use.)

    Level 5 questions the premise of an “ad-hoc task”. Wisdom from the trenches often tells us that there is no such thing a one-time / ad-hoc task. Everything ends up being repeatable. For instance, when creating a user on one database, one may bring up her favorite ad-hoc task GUI, click here, click there, type in a bunch of command attributes and then hit the Execute button. That works for creating one user. However when the exact same user needs to be rolled out on 20 servers, it involves a lot of pointing, clicking and typing and leaves the environment vulnerable to human errors. Suddenly the ad-hoc GUI ceases to be effective.

    Requirement levels 5 to 9 address this by calling for standard operating procedures (SOPs) that can be applied across multiple environments from a central location. Level 6 requires diverse sub-environments to be categorized such that different activities pertaining to them (including all service requests and incident remediation efforts) can all have standard task recipes. This refers to rolling up disparate physical environments into fewer logical components based on policies and usage attributes including service level requirements.

    Task recipes, once defined, need to be maintained in a central knowledgebase and need to be in a format wherein they can easily serve as a blue-print for any subsequent automation. Further, the task recipes need to be directly linked to automation routines/workflows in a 1:1 manner such that one can reach the workflows from the SOP and vice versa. In other words, there needs to be a shared intelligence between the SOP and the automation routine. Keeping the two separate (for instance, keeping the task recipes on a SharePoint portal and keeping the automation routines within a scheduling or workflow tool, or worse, as a set of scripts distributed locally on the target servers) without any hard-wired connection and tracking between the two mediums will make it easy for task recipes and automation routines to get out of sync. Such automation tends to be uncontrolled; each user of such automation is left to his/her own devices to leverage it as optimally as possible, and eventually, its utility becomes questionable at best with each user (administrator) customizing the automation routines to their particular environment and their individual preferences. Hitherto noble notions such as shared intelligence and centralized control across task recipes and automation routines (and eventually, consistency in quality of work) go out the window!

    With shared intelligence comes the ability to track and enforce standard task recipes across different personnel and environments. Level 10 states this exact requirement - the ability to maintain a centralized audit trail describing when an automation routine ran, on what server, who ran it (or what event triggered it), what were the run-results and so on – for ongoing assessment of SOP quality, pruning the task recipes and corresponding automation code, and finally, to ensure enterprise-wide adherence to best practices.

    Level 11 or virtualization allows automation to be applied in a easier manner across the different environment categories defined in level 6. It does so via a “hypervisor” that abstracts and masks the different nuances across multiple categories by having a SOP call wrapper applying a task across different environment types (ideally, done via an expert system). Within that single SOP call, auto-discovery capabilities identify the current state of a target server or application, evoke a decision tree to determine how to best perform a task and then finally, call the appropriate SOP (or sub-SOP, as the case may be) to carry out the task in the pre-defined and pre-approved manner that’s most suited to that environment. In other words, level 11, much like storage virtualization that occurs within a SAN (think EMC Invista, not VMWare), calls for multiple disparate environments to be viewed as a single environment (or at the very least, a smaller subset of environments) and dealt with in an easier manner.

    Levels 12 to 15 allow increasingly sophisticated levels of analysis and intelligence to be applied to a target environment, including event correlation, root cause analysis, and predictive analytics to be able to discern problems as they occur or ideally, before they occur and link those back to one or more SOPs which can be automatically triggered to avert an outage, performance degradation or policy violation and retain status quo, thereby making the target environment more autonomic.

    Granted that many companies and administrators may not have the appetite to go all the way up to level 15 for most administrative tasks. But regardless, I would hope that this model provides a clear(er) roadmap for automation than just different groups and individuals scrapping together a bunch of proprietary scripts and Word documents, and keeping relevant information on how to exactly execute them within their heads or worse, each group relying on a bunch of disjointed tools, vendors and promises, resulting in a fragmented automation strategy and chaotic (read, unmeasurable) results.

    Maybe unmeasurable results are not such a bad thing, especially if you are a large and well-established software vendor, who has little value to deliver, but thrives on FUD to keep the market confused.

    If you are an IT manager or administrator and have a favorite software tools vendor, ask them how their automation strategy stacks up against these 15 levels. Drop me a note if you get an answer back.

    Thursday, August 09, 2007

    Huge Disparity Between Generic IT Automation Trends and Database Automation Initiatives

    Let me start out with an apology for having been somewhat of an irregular blogger over the last few months. The good thing is, I haven’t been slacking off (at least not the entire time). Stratavia’s sustained growth has kept me on my toes. We just closed our Series B round and brought in several more customers. I have been working to set up an internal automation task force (ATF) within some of the larger customers to target areas and activities that would yield the most efficiencies via standardization and automation. This ATF effort has been received well as it is helping larger organizations gain value more rapidly from Data Palette and their overall automation efforts (in weeks rather than months). Also, I was fortunate enough to be able to present two papers at Collaborate 2007 in Vegas (If you attended one of those sessions – thank you!) and helped with Stratavia’s sponsorship of the recent Gartner Symposium in San Francisco.

    At the symposium, I was rather surprised to see the space reserved for “data center operations” bustling with several new vendors and their automation offerings. Having focused on database automation for a while and being only too familiar with the typical DBA shop’s lukewarmness in embracing process automation, I thought the overall IT automation space was still nascent. Boy, was I mistaken!

    As I compare generic IT automation trends and the prevailing action (or lack thereof) in the database automation arena, I’m literally filled with despair for the latter. At the Collaborate 2007 event, the sessions that drew the most attendees seemed to be either on Oracle RAC (real application clusters) or performance tuning – which tells me, the typical DBA is still enamored with learning how to configure clustered databases and/or turning performance knobs. She is in unchartered territory when it comes to process and decision automation – in fact, most DBAs still seem to be unfamiliar with the term run book automation (a sad statistic… especially in this day and age). Their knowledge of automation seem largely limited to “lights out” monitoring and patching – the kind provided by toys… oops! I mean tools like Oracle’s Enterprise Manager/Grid Control and Microsoft MOM. (Hello wake up, it’s not the 1990s anymore!)

    On the other hand, there’s been a ton of interesting events in the generic IT automation space, as evidenced with the recent $1.65 Billion purchase of Opsware by HP and the acquisition of RealOps by BMC. The only downside to these transactions is that some of these products or rather, the capabilities in these product portfolios will likely die a premature death as the considerably larger acquirers lumber along, without knowing how to properly position and market these new-found competencies. (I just can’t see the larger vendors spending the requisite time and effort to evangelize and market these offerings effectively. Maybe I’m wrong and with these 800-pound entities entering this space, they will bring more legitimacy and avoid the need for an “evangelical sale” altogether. That would be a nice position for the IT automation industry to be in.) Hopefully the data management and database administration spaces will catch up with the rest of the IT automation industry soon.

    Anyway, generic IT automation is not without its share of problems. The biggest one I see is that many CIOs continue to equate IT automation to just provisioning and patching. Heck, there’s so much more and by limiting the scope of automation, companies are not gaining the full value of automation. When you limit your scope, you inherently begin to seek and attract point solutions. No doubt, you need to start somewhere, and provisioning is as good an area as any to get started, but try not to be too narrowly focused and also, don’t just sit there basking in the results of a provisioning deployment for months and years. Get your Phase 1 scoped and deployed, measure results, make the appropriate tweaks and then move on to the next phase and the next set of “high-bar” tasks. And most importantly, make sure the automation platform you select has the capability to accommodate these all of these high-bar activities, without the need to acquire more tools.

    The advent of run book automation (which BTW, is a nice descriptive name for process automation coined by Gartner) has really taken overall IT automation to the next level. Readers of my prior blog entries have heard me rave about decision automation. I personally see decision automation as being the next evolution of run book automation. It is akin to “run book automation on steroids” – the latter referring to capabilities such as auto-discovery, metadata injection, root cause analysis and predictive analytics – things that are rather essential for truly automating decisions. Decision automation takes run book automation beyond vanilla process orchestration or being merely the glue for carrying out ITIL-based activities in a specific sequence by incorporating a model for extracting and leveraging collective intelligence across processes.

    Over the last year, Stratavia has been marrying run book automation with decision automation capabilities, especially in and around the database. That is proving to be a real game changer for many customers because a huge gap has existed in this area. Interestingly, many of Stratavia’s customers tend to be companies that already have provisioning and run book automation tools such as Opsware and RealOps, but have used them in limited/isolated areas such as server patching, software releases, and so on. By augmenting run book automation with root cause analysis capabilities (which is beyond the typical provisioning and run book automation tools’ competency), companies can better relate to recurring and disruptive problem patterns and reduce the number of incidents by converting reactive alerts to proactive maintenance processes. That effects typical SLA metrics like on-time-delivery and first-time-right, and makes the business more agile via speedier project deployments and enhanced time-to-market. The CIOs of these organizations are not content with merely rolling out an enterprise monitoring and ticketing solution, but are relentless in wanting to see themselves better aligned with the Service-Oriented or Value-Driven levels of Gartner’s operational maturity model and the advantages it brings their customers and employees.

    Unfortunately this is not a universal trend yet. Many companies have yet to look at IT automation in a holistic way. The right way to go about it is not to jump directly into automation and address a point area or task and then basking in your laurels. I respectfully suggest the right way to automate comprises 15 distinct steps. And these steps have to be implemented in exactly the right order. Companies and products that try to short-cut the process (for example, by going directly to automation without first building standard and centralize operating procedures) end up not realizing the value they aimed for. (What is this “15-step” business, you ask? Well, more on that in a future blog entry...)

    So what are the vendor/product choices for companies looking to implement IT automation the “right way”? I would urge you to consider Stratavia’s Data Palette without the slightest hesitation. A recent article in TheDeal (available at www.thedeal.com; the site requires a subscription to view content) has Gartner’s David Williams describing Stratavia as the only company he knows of, that brings run book automation capability to databases. (The exact statement in the article is “Stratavia makes run book automation software for an IT database, or software that enables a user to define a set of processes and integrate the necessary components to execute it automatically, according to Gartner Inc. analyst David Williams. The six-year-old company is one of the only companies Williams said he knows of to focus specifically on the underlying database.”; If you are interested in the entire article, drop me a line.) I don’t know about you, but to me, that’s a revealing statement from a well-respected analyst. And BTW, that’s what some of my customers are telling me as well - after evaluating other monitoring tools and database automation point solutions such as the Oracle Enterprise Managers and Grid Controls of the world, they are choosing to move ahead with a Data Palette deployment.

    Do you know of any other process automation or decision automation products for databases? If so, please point me towards them; I would love to learn more and write about them.

    Sunday, April 08, 2007

    An “ideal database health check” – what does that really mean?

    A good friend of mine, a senior Oracle DBA, recently looked at the automated database health check report we provide our customers and pointed out an interesting fact. He said "out of the 40 or so metrics you guys look at, at least 10 are outdated!" When I asked him to expound further, he responded “well, your report is still looking at cache hit ratios and such. The prevailing sentiment in the DBA community (especially the Oracle DBA community) is that hit ratios are useless. What truly matters are the wait stats and trace output.”

    Now, does it really? I queried about a dozen of my other DBA friends on what they thought and unfortunately, it was a mixed message. Hours of Google searches showed a disappointing trend: the term “health check” is somewhat abused and there are no two similar health check utilities out there. Every DBA group that uses one seems to kinda make one up on the fly. I wish there was a standard health check that one could rely on.

    Anyway, my limited research makes me wonder – why isn’t there an easily available standard health check? And more importantly, what really constitutes an “ideal” health check? Here are my two cents on the topic:

    An ideal health check should provide an overview of a database’s stability across three major areas for consideration:

    • Availability
    • Performance
    • Scalability

    Now, one could argue that all 3 of the above areas just point to one more area: performance. However from a practical standpoint, I would like to treat these 3 areas as distinct. For instance, in my mind, the well-being of a hot standby maintained via a Data Guard physical configuration is required for ensuring high availability, not necessarily for performance; whereas, ensuring sufficient buffer caches have been configured helps facilitate optimal performance moreso than availability. Also, in certain cases, performance and availability related actions may be at odds with one another. For instance, one may configure a redo-log switch every 10 minutes to ensure the primary database and hot standby are within 10-15 minutes of each other (helps availability), whereas the same action may result in higher I/O and negatively impact performance (especially if the system is already I/O saturated).

    So, going back to the “ideal healthcheck”, one should ensure that there are no existing or upcoming issues that are impeding or are likely to impede any of these three areas. Below is a “top 40” list of things that would help ensure that is indeed the case. Comments are placed in-line, next to the statistic name/type, in italics.

    Category: Availability
    1. Database space (Should consider both database space and OS space; would be ideal to also consider growth trends.)
    2. Archive log space
    3. Dump area space (bdump, cdump, adump, udump, etc.)
    4. Archive logs being generated as per SLA (Note: This requires the health check utility to have an understanding of the SLA pertaining to standby database upkeep.)
    5. Archive logs being propagated as per SLA
    6. Snapshot/Materialized view status
    7. Status of DBMS Jobs
    8. Replication collisions
    9. Backup status
    10. Online redo logs multiplexed (On different mount-points.)
    11. Control file multiplexed (On different mount-points.)
    12. Misc errors (potential bugs) in alert.log

    Category: Performance
    13. Disparate segment types (tables, indexes, etc.) in same tablespace
    14. SYSTEM tablespace being granted as default or temp tablespace
    15. Temporary tablespace not being a true temp tablespace
    16. Deadlock related errors in alert.log
    17. Non-symmetric segments or non-equi sized extents in tablespace (For dictionary managed tablespaces.)
    18. Invalid objects
    19. Any event that incurred a wait over X seconds (“X” to be defined by user during healthcheck report execution. Default value could be 5 seconds. Obviously, for this value to be available, some kind of stats recording mechanism needs to be in place. In our case, Data Palette is used to collect these stats so the health check report can query the Data Palette repository for wait events and corresponding durations.)
    20. Hit ratios: DB buffer cache, redo log, SQL area, dictionary/row cache, etc. (While there is a mixed opinion on whether these are useful or not, I like to include them for DBAs that do rely on them to identify whether any memory shortage exists in the database instance and adjust the related resource(s) accordingly. While I do have an opinion on this matter, my goal is not to argue whether this stat is useful or not, instead, it’s to provide them to people that need them (and there are quite a few folks that still value hit ratios.)
    21. I/O / disk busy (I/O stats, at the OS and database levels.)
    22. CPU load average or queue size
    23. RAM usage
    24. Swap space usage
    25. Network bandwidth usage (Input errors, output errors, queue size, collisions, etc.)
    26. Multi-threaded settings (Servers, dispatchers, circuits, etc.)
    27. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)

    Category: Scalability (Note: For ensuring there are no scalability related issues, the health check generating mechanism ideally should be able to relate to current resource consumption trends and apply predictive algorithms to discern whether there will be contention or shortfall. In the absence of such predictive capabilities, a basic health check routine can still use thresholds to determine whether a resource is close to being depleted.)

    28. Sessions
    29. Processes
    30. Multi-threaded resources (dispatchers, servers, circuits, etc.)
    31. Disk Space
    32. Memory structures (locks, latches, semaphores, etc.)
    33. I/O
    34. CPU
    35. RAM
    36. Swap space
    37. Network bandwidth
    38. RAC related statistics (False pings, cache fusion and interconnect traffic, etc. – based on the Oracle version.)
    39. Understanding system resources consumed by non-DB processes running on the same server/domain (3rd party applications such as ETL jobs, webservers, app servers, etc.)
    40. Understanding system resources consumed by DB-related processes running outside their normal scheduled window (Applications such as backup processes, archive log propagation, monitoring (OEM) agents, etc. This requires the health check utility to know which processes are related to the database and their normal execution time/frequency.)

    The health check utility can show the above areas in red, yellow or green depending on whether any statistics needs urgent attention (red), needs review (yellow) or is healthy (green). Accordingly, the health of the database can be numerically quantified based on where each of the statistics show up (a statistic being in green status earns it 1 point, a yellow earns it 0.5 and a red earns it 0).

    The above list can be expanded to accommodate additional configurations such as a custom standby setup or database audit requirements, and statistics pertaining to more complex environments such as RAC and Advanced Replication. Even entire categories can be added as appropriate. For example, “Security” would be a good category to add with statistics pertaining to user related information including account lock outs, and password change status, tablespace quotas, audit usage, virtual private database configuration, and so on.

    Database and instance level health checks need to be shown separate, regardless of RAC. Things that need to show up at the database level would be statistics such as database space, etc., whereas statistics such as memory settings and hit ratios would be in the Instance health check report.

    Lastly, it would be nice if the health check could include a recommendation on how to resolve a statistic that’s showing up in yellow or red, or even better, offer a link to an automated standard operating procedure that would help fix the situation.

    I have meant this to be an open list. So any thoughts from the community on further augmenting (or better yet, standardizing) this list would be appreciated.


    Tuesday, February 06, 2007

    Using Decision Automation in Disaster Recovery

    DM Review recently published my article on “Using Decision Automation in Database Administration” (http://www.datawarehouse.com/article/?articleId=6876). Right after that, I get this interesting question from a reader enquiring about leveraging decision automation in disaster recovery (DR). Specifically, she asks “a good application I can think of for decision automation is disaster recovery. Can you outline a model for how one could go about implementing it?”. I wanted to share my response with everyone. So here I go…

    Once the right DR model is designed, automation in general, is invaluable in rolling it out and maintaining it; ensuring it works as per spec ("spec" being the enterprise service level requirements pertaining to uptime, failover and performance). However the one area that decision automation specifically is suited for is establishing a centralized command / control mechanism to allow failover to occur under the right circumstances, and in the process, dealing with false alarms.

    The foremost (and most natural) reaction of many IT administrators when they face an outage (even a simple disk-array failure, much less a true disaster) is panic. During this state, the last thing you want them to do is think on their feet to figure out whether to fail over to the DR system and if so, go about making that happen manually. Especially, if there are other unanticipated hiccups in the process (which happen more often than not at that very time, given Murphy's law), that usually results in service level violations or worse, a failed DR initiative.


    To better appreciate this, imagine a nation’s nuclear arsenal being subject to manual controls and individual whims rather than a fully automated and centralized command / control system. If relying on a human being (be it the president or prime minister or your local fanatic dictator) or worse, multiple humans to make the call re: launch and then deploy manually, the system would be prone to mood swings , and other assorted emotions and errors resulting in the weapons being deployed prematurely, or not at all even when the situation calls for it. Further, once the decision is made to deploy or not deploy, manual errors can delay or prevent the required outcome.

    These are the situations where decision automation can come to the rescue. Decision automation can completely do away with the manual aspects of deciding whether or not to initiate failover and how (partial or full). It can coldly look at the facts it has accumulated and allow pre-determined business logic to figure out next steps and initiate an automated response sequence as required. Let's look at what that means.

    Let's say the service level agreement requires a particular system or database to provide 99.999% uptime (five 9s). That means, that database can be down for a maximum of 5 minutes of unplanned downtime during the year.

    Let’s make the problem even more interesting. Let's pretend the company is somewhat cash-poor. Instead of spending gobs of money on geo-mirroring solutions like EMC SRDF and other hardware/network level replication (even “point-in-time copy” generators like EMC’s SnapView, TimeFinder or NetApp’s SnapVault), it has invested in a simple 2-node Linux cluster to act as the primary system and two other SMP machines to host a couple of different standby databases. The first standby database is kept 15 minutes behind the primary database and the other standby remains 30 minutes behind the primary. (The reason for the additional delay in the latter case is to provide a longer window of time to identify and stop any user errors or logical corruption from being propagated to the second standby.) The primary cluster is hosted in one location (say, Denver) and the other two standby databases are being hosted in geographically disparate data centers with data propagation occurring over a WAN.

    Given this scenario, at least two different types of failover options emerge:
    - Failover from the Primary cluster to the first standby
    - Failover from the Primary to the second standby

    The underlying mechanisms to initiate and process each option is different as well. Also, given that the failover occurs from two-node clustered environment to a single node, there may be a service level impact that needs to be considered as well. But first things first!

    Rolling out this DR infrastructure should be planned, standardized, centralized and automated (not “scripted”, but automated via a robust run book automation platform). You may think, why the heck not just do it manually? Because building any DR infrastructure, no matter how simplistic or arcane it is, is never a one-time thing! (Not unlike most IT tasks…). Once failover occurs, failback has to follow eventually. The failover/failback cycle has to be repeated multiple times during fire-drills as well as during an actual failure scenario. Once failover to a secondary database happens, the primary database has to be rebuilt and restored to become a secondary system itself or in the above situation since it's a cluster (presumably provides more horsepower and the system when it fails over to a single node has to operate in reduced capacity), failback has to happen to reinstate the cluster as the primary server. So in the frequent cycles of failover, rebuilding and failback, you don't want people to keep doing all these tasks manually. The inconsistency in quality of work (depending on which DBA is doing which task) could cause inefficiencies to creep in resulting in a flawed DR infrastructure. Human errors could compromise the validity of the infrastructure.

    So the entire process of building, testing, maintaining, and auditing the DR infrastructure needs to be standardized and automated. Also included in the “maintenance” part is the extraction of the archived/transaction logs every 15 minutes, propagating to the first standby server and applying it. Similarly, in the case of the second standby, the logs have to be applied every 30 minutes to ensure the fixed latency is maintained. In the case of certain versions of certain database platforms (namely, Oracle and DB2), there are DBMS supplied options to establish the same. Regardless, the right mechanisms and components have to be chosen and deployed in a standardized and automated manner. Period.

    Now, the DR infrastructure could be well deployed, tested and humming, but during the critical time of failover, it needs to be evaluated what kind of failure is being experienced, what is the impact to the business, how best to contain the problem symptoms, transition to a safe system and to deal with the impact on performance levels. Now, applying that to our example, it needs to determined:
    - Is the problem contained in a single node of the cluster. To use an Oracle example, if it’s an instance crash, then services (including existing connections, in some cases such as via Transparent Application Failover or TAF) can be smoothly migrated to the other cluster node.


    - If the database itself is unavailable due to the shared file-system having crashed ( in spite of its RAID configuration), then the cluster has to be abandoned and services need to be transferred to the first standby (the one that’s 15 mins behind). As part of the transition, any newer archived logs files that haven’t yet been applied need to be copied and applied to the first standby. If parts of the filesystem that hold the online redo logs are not impacted, they need to be archived as well to extract the most current transaction(s), copied over and applied. Once the standby is synchronized to the fullest extent possible, it needs to be brought up along with any related database and application services such as the listener process. Applications need to be rerouted to the first standby (which is now the primary database) either via implicit rerouting by reconfiguring any middleware or via explicit rerouting (caused by choosing a different application configuration file/address or even, lower level mechanisms such as IP address takeover, wherein the standby server will take over the public IP address of the cluster).

    - If for any reason, the first standby is not reliable (say, logical corruption has spread to that server as well or it is not starting up in the manner expected due to other problems), the decision needs to be made to go to the second standby, carry out the above described process and bring up the necessary services.

    The real challenge is not just having to make all these decisions on the fly based on the type and scope of failure, and the state of the different primary and standby systems , it is having to make them and complete the entire failover process within the required time (5 minutes, in our example). Organizations cannot realistically expect an panic stricken human being to carry out all this quickly on the fly (yet, ironically, that’s exactly what happens 80% of the time!).

    IT professionals sometimes say, “I don’t trust automation; I need to be able to do things myself to be able to trust it.” Well, would YOU entrust such a delicate process wrought with constant analyzing/re-analyzing and ad-hoc decision making to a human being who may be sick or on vacation and out of the office? Someone who may have left the company since then? Would you be content merely allowing this process to be documented and hoping someone else (other than the person who designed and built the solution) in the IT team can perform it without problems? I for one, wouldn’t.

    The range of situations for applying decision automation are manifold and implementing a centralized command / control system for initiating failover in the DR process just happens to be an ideal candidate.