Diving Deep into Sublime Seas

Friday, May 15, 2009

Implementing a Simple Internal Database or Application Cloud – Part II

Based on the overview of private clouds in my prior blog, here’s the 5-step recipe for launching your implementation:

Identify the list of applications that you want to deploy on the cloud;

Document and publish precise end-user requirements and service levels to set the right expectations – for example, time in hours to deliver a new database server;

Identify underlying hardware and software components – both new and legacy – that will make up the cloud infrastructure;

Select the underlying cloud management layer for enabling existing processes, and connecting to existing tools and reporting dashboards;

Decide if you wish to tie in access to public clouds for cloud bursting, cloud covering or simply, backup/archival purposes.

Identify the list of applications that you want to deploy on the cloud
The primary reason for building a private cloud is control – not only in terms of owning and maintaining proprietary data within a corporate firewall (mitigating any ownership, compliance or security concerns), but also deciding what application and database assets to enable within the cloud infrastructure. Public clouds typically give you the option of an x86 server running either Windows or Linux. On the database front, it’s usually either Oracle or SQL Server (or the somewhat inconsequent MySQL). But what if your core applications are built to use DB2 UDB, Sybase or Informix? What if you rely on older versions of Oracle or SQL Server? What if your internal standards are built around AIX, Solaris or HP/UX system(s)? What if your applications need specific platform builds? Having a private cloud gives you full control over all of these areas.

The criteria for selecting applications and databases to reside on the cloud should be governed by a single criterion – popularity; i.e., which applications are likely to proliferate the most in the next 2 years, across development, test and production . List down your top 5 or 10 applications and their infrastructure – regardless of operating system and database type(s) and version(s).

Document precise requirements and SLAs for your cloud
My prior blog entry talked about broad requirements (such as self-service capabilities, real-time management and asset reuse), but break those down into detailed requirements that your private cloud needs to meet. Share those with your architecture / ops engineering peers, as well as target users (application teams, developer/QA leads, etc.) and gather input. For instance, a current cloud deployment that I’m working with is working to meet the following manifesto, arrived at after a series of meticulous workshops attended by both IT stakeholders and cloud users:

In the above example, BMC Remedy was chosen as the front-end for driving self-service requests largely because the users were already familiar with using that application for incident and change management. In its place, you can utilize any other ticketing system (e.g., HP/Peregrine or EMC Infra) to present a friendly service catalog to end users. Up-and-coming vendor Service-now extends the notion of a service catalog to include a full-blown shopping cart and corresponding billing. Also, depending on which cloud management software you utilize, you may have additional (built-in) options for presenting a custom front-end to your users – whether they are IT-savvy developers, junior admin personnel located off-shore or actual application end-users.

Identify underlying cloud components
Once you have your list of applications and corresponding requirements laid out, you can begin to start the process of defining which hardware and software components you are going to use. For instance, can you standardize on the x86 platform for servers with VMware as the virtualization layer? Or do you have a significant investment in IBM AIX, HP or Sun hardware? Each platform tends to have its own virtualization layer (e.g., AIX LPARs, Solaris Containers, etc.), all of which can be utilized within the cloud. Similarly, for the storage layer, can you get away with just one vendor offering – say, NetApp filers, or do you need to accommodate multiple storage options such as EMC and Hitachi? Again, the powerful thing about private clouds is – you get to choose! During a recent cloud deployment, the customer required us to utilize EMC SANs for production application deployments, and NetApp for development and QA.

Also, based on application use profiles and corresponding availability and performance SLAs, you may need to include clustering or facilities for standby databases (e.g., Oracle Data Guard or SQL log shipping) and/or replication (e.g., GoldenGate, Sybase Replication Server).

Now as you read this, you are probably saying to yourself – “Hey wait a minute… I thought this was supposed to be a recipe for a ‘simple cloud’. By the time I have identified all the requirements, applications and underlying components (especially with my huge legacy footprint), the cloud will become anything but simple! It may take years and gobs of money to implement anything remotely close to this…” Did I read your mind accurately? Alright, let’s address this question below.

Select the right cloud management layer
Based on all the above items, the scope of the cloud and underlying implementation logistics can become rather daunting – making the notion of a “simple cloud” seem unachievable. However here’s where cloud management layers comes to the rescue. A good cloud management layer keeps things simple via three basic functions:

Abstraction;
Integration; and
Out-of-the-box automation content

Cloud management software ties in existing services, tools and processes to orchestrate and automate specific cloud management functions end-to-end. Stratavia’s Data Palette is an example of intelligent cloud management software. Data Palette is able to accommodate diverse tasks and processes – such as asset provisioning, patching, data refreshes and migrations, resource metering, maintenance and decommissioning – due to significant out-of-the-box content. (Stratavia references them as Solution Packs, which are basically discrete products that can be plugged into the underlying Data Palette platform.) All such content is externally referenceable via its Web Service API for easy integration with existing tools – making it easy to integrate with 3rd party service catalogs such as Service-Now or Remedy.

Data Palette does not post restrictions on server, operating system and infrastructure components within the cloud. Its database Solution Packs support various flavors and versions of Oracle, SQL Server, DB2, Sybase and Informix running on UNIX (Solaris, HP/UX and AIX), Linux and Windows. Storage components such as NetApp are managed via the Zephyr API (ZAPI) and OnTapi interfaces.

However in addition to out-of-the-box integration and automation capabilities, the primary reason for keeping complexity at bay is due to the use of an abstraction layer. Data Palette uses a metadata repository that is populated via native auto-discovery (or via integration to pre-deployed CMDBs and monitoring tools) to gather a set of configuration and performance metadata that identifies the current state of the infrastructure, along with centrally defined administrative policies. This central metadata repository makes it possible for the right automated procedures to be executed on the right kind of infrastructure – avoiding mistakes (typically associated with static workflows from classic run book automation products)) such as executing an Oracle9i specific data refresh method on a recently upgraded Oracle10g database – without the cloud administrator or user having to track and reconcile such infrastructural changes and manually adjust/tweak the automation workflows. Such metadata-driven automation keeps the automation workflows dynamic, allowing the automation to scale seamlessly to hundreds or thousands of heterogeneous servers, databases and applications.

Metadata collections can also be extended to specific (custom) application configurations and behavior. Data Palette allows Rule Sets to be applied to incoming collections to identify and respond to maintenance events and service level violations in real-time making the cloud autonomic (in terms of self-configuring and self-healing attributes), with detailed resource metering and administrative dashboards.

Data Palette’s abstraction capabilities also extends to the user interface wherein specific groups of cloud administrators and users are maintained via multi-tenancy (called Organizations), Smart Groups™ (dynamic groupings of assets and resources), and role based access control.

Optionally tie in access to public clouds for Cloud Bursting, Cloud Covering or backup/archival purposes
Now that you are familiar with the majority of the ingredients for a private cloud rollout, the last item worth considering is whether to extend the cloud management layer’s set of integrations to tie into a public cloud provider – such as GoGrid, Flexiscale or Amazon EC2. Based on specific application profiles, you may be able to use a public cloud for Cloud Bursting (i.e., selectively leveraging a public cloud during peak usage) and/or Cloud Covering (i.e., automating failover to a public cloud). If you are not comfortable with the notion of a full service public cloud, you can consider a public sub-cloud (for specific application silos e.g., “storage only”) such as Nirvanix or EMC Atmos for storing backups in semi-online mode (disk space offered by many of the vendors in this space is relatively cheap – typically 20 cents per GB per month). Most public cloud providers offer an extensive API set that the internal cloud management layer can easily tap into (e.g., check out wiki.gogrid.com). In fact, depending on your internal cloud ingredients, you can take the notion of Cloud Covering to the next level and swap applications running on the internal cloud to the external cloud and back (kind of an inter-cloud VMotion operation, for those of you who are familiar with VMware’s handy VMotion feature). All it takes is an active account (with a credit card) with one of these providers to ensure that your internal cloud has a pre-set path for dynamic growth when required – a nice insurance policy to have for any production application – assuming the application’s infrastructure and security requirements are compatible with that public cloud.

Wednesday, January 28, 2009

Implementing a Simple Internal Database or Application Cloud - Part I

A “simple cloud”? That comes across as an oxymoron of sorts since there’s nothing seemingly simple about cloud computing architectures. And further, what do DBAs and app admins have to do with the cloud, you ask? Well, cloud computing offers some exciting new opportunities for both Operations DBAs and Application DBAs – models that are relatively easy to implement, and bring immense value to IT end-users and customers.

The typical large data center environment has already embraced a variety of virtualization technologies at the server and storage levels. Add-on technologies offering automation and abstraction via service oriented architecture (SOA) are now allowing them to extend these capabilities up the stack – towards private database and application sub-clouds. These developments seem more pronounced in the banking, financial services and managed services sectors. However while working on Data Palette automation projects at Stratavia, every once-in-a-while I do come across IT leaders, architects and operations engineering DBAs in other industries as well, that are beginning to envision how specific facets of private cloud architectures can enable them to service their users and customers more effectively (while also compensating for the workload for some of their colleagues that have exited their companies due to the ongoing economic turmoil). I wanted to specifically share here some of the progress in database and application administration with regard to cloud computing.

So, for those database and application admins that haven’t had a lot of exposure to cloud computing (which BTW, is a common situation since most IT admins and operations DBAs are dealing with boatloads of “real-world hands-on work” rather than participating in the next evolution of database deployments), let’s take a moment to understand what it is and its relative benefits. An “application” in this context, refers to any enterprise-level app - both 3rd party (say, SAP or Oracle eBusiness Suite) as well as home-grown N-Tier apps that have a fairly large footprint. Those are the kind of applications that get maximum benefit from the cloud. Hence I use the word “data center asset” or simply “asset” to refer to any type of database or application. However at times, I do resort to specific database terminology and examples, which can be extrapolated to other application and middleware types as well.

Essentially a cloud architecture refers to a collection of data center assets (say, database instances, or just schemas to allow more granularity) that are dynamically provisioned and managed throughout their lifecycle – based on pre-defined service levels. This lifecycle covers multiple areas starting with deployment planning (e.g., capacity, configuration standards, etc.), provisioning (installation, configuration, patching and upgrades) and maintenance (space management, logfile management, etc.) extending all the way to incident and problem management (fire-fighting, responding to brown-outs and black-outs), and service request management (e.g., data refreshes, app cloning, SQL/DDL release management, and so on). All of these facets are managed centrally such that the entire asset pool can be viewed and controlled as one large asset (effectively virtualizing that asset type into a “cloud”).

Here’s a picture representing a fully baked database cloud implementation (if the picture is blurry, click on it to open up a clearer version):

As I had mentioned in a prior blog entry, there are multiple components that have come together to enable a cloud architecture. But more on that later. Let’s look at database/application specific attributes of a cloud (you could read it as a list of requirements for a database cloud).

Self-service capabilities: Database instances or schemas need to be capable of rapidly being provisioned based on user specifications by administrators, or by the users themselves (in selective situations – areas where the administrators feel comfortable giving control to the users directly). This provisioning can be done on existing or new servers (the term “OS images” is more appropriate given that most of the “servers” would be virtual machines rather than real bare metal) with appropriate configuration, security and compliance levels. Schema changes or SQL/DDL releases can be rolled out in a scheduled manner, or on-demand. The bulk of these releases, along with other service requests (such as refreshes, cloning, etc.) should be capable of being carried out by project teams directly– with the right credentials (think, role-based access control).

Real-time infrastructure: I'm borrowing a term from Gartner (specifically, distinguished analyst Donna Scott's vocabulary) to describe this requirement. Basically, the assets need to be maintained in real-time per specific deployment policies (such as development environment versus QA or Stage), tablespaces and datafiles created per specific naming / size conventions and filesystem/mount-point affinity (accommodating specific SAN or NAS devices, different LUN properties and RAID levels for reporting/batch databases versus OLTP environments), data backed up at the requisite frequency per the right backup plan (full, incremental, etc.), resource usage metered, failover/DR occurring as needed, and finally, archived and de-provisioned based on either a specific time-frame (specified at the time of provisioning) or on-demand -- after the user or administrator indicates that the environment is no longer required (or after a specific period of inactivity). All of this needs to be subject to administrative/manual oversight and controls (think, dashboards and reports, as well as ability to interact with or override automated workflow behavior).

Asset type abstraction and reuse: One should be able to mix-and-match these asset types. For instance, one can rollout an Oracle-only database farm or a SQL Server-only estate. Alternatively, one can also span multiple database and application platforms allowing the enterprise to better leverage their existing (heterogeneous) assets. Thus, the average resource consumer (i.e., the cloud customer) shouldn’t have to be concerned about what asset types or sub-types are included therein – unless they want to override default decision mechanisms. The intra-cloud standard operating procedures take those physical nuances into account, thereby effectively virtualizing the asset type.

The benefit of a database cloud includes empowering users to carry out diverse activities in self-service mode in a secure, role-based manner, which in turn, enhances service levels. Activities such as having a database provisioned or a test environment refreshed can often take multiple hours and days. Those can be reduced to a fraction of their normal time – reducing latency especially in situations where there needs to be hand-offs and task turn-over across multiple IT teams. In addition, the resource-metering and self-managing capabilities of the cloud allow better resource utilization and avoids resource waste, improving performance levels, and reducing outages and removing other sources of unpredictability from the equation.

A cloud, while viewed as bleeding edge by some organizations is being viewed by larger organizations as being critical – especially in the current economic situation. Rather than treating each individual database or application instance as a distinct asset and managing it per its individual requirements, a cloud model allows virtual asset consolidation, thereby allowing many assets to be treated as one and promoting unprecedented economies of scale in resource administration. So as companies continue to scale out data and assets, but cannot afford to correspondingly scale up administrative personnel , the cloud helps them achieve non-linear growth.

Hopefully the attributes and benefits of a database or application cloud (and the tremendous underlying business case) become apparent here. My next blog entry (or two) will focus on the requisite components and the underlying implementation methods to make this model a reality.

Friday, January 09, 2009

Protecting Your IT Operations from Failing IT Services Firms

The recent news about India-based IT outsourcing major Satyam and its top management’s admissions of accounting fraud bring forth shocking and desperate memories of an earlier time – when multiple US conglomerates such as Enron, Arthur Andersen, Tyco, etc. fell under similar circumstances, bringing down with them the careers and aspirations of thousands of employees, customers and investors. Ironically Satyam (the name means “truth” in the mother language, Sanskrit), whose management have been duping investors for several years now (by their own admission) had received the Recognition of Commitment award from the US-based Institute of Internal Auditors in 2006, and was featured in Thomas Friedman’s best-seller “The World is Flat”. Indeed, how the mighty have fallen…

As one empathizes with those affected, the key question that comes to mind is, how do we prevent another Satyam? However that line of questioning seems rather idealistic. The key question should probably be, how can IT outsourcing customers protect themselves from these kinds of fallouts? Given how flat the world is, an outsourcing vendor’s (especially one as ubiquitous as Satyam in this market) fall from grace has reverberations throughout the global IT economy - directly in the form of failed projects, and indirectly in the form of lost credibility for customer CIOs who rely on these outsourcing partners for their critical day-to-day functioning.

Having said that, here are some key precautionary measures (in an evolving order) companies can take to protect themselves and their IT operations beyond standard sane efforts such as using multiple IT partners, use of structured processes and centralized documentation/knowledge-bases.
· Move from time & material (T&M) arrangements to fixed-priced contracts
· Move from static knowledge-bases to automated standard operating procedures (SOPs)
· Own the IP associated with process automation

Let’s look at how each of these afford higher protection in situations such as the above:
· Moving from T&M arrangements to fixed price contracts - T&M contracts rarely provide incentive to the IT outsourcing vendor to bring in efficiencies and innovation. The more hours that are billed, the more revenue they make – so,where’s the motivation to reduce the manual labor? On the other hand, T&M labor makes customers vulnerable to loss of institutional knowledge and gives them little to no leverage when negotiating rates or contract renewals because switching out a vendor (especially one that holds much of the “tribal knowledge”) is easier said than done.

With fixed price contracts, the onus on ensuring quality and timely delivery is on the IT services vendor (to do so profitably requires use of as little labor as possible) and subsequently, one finds more structure (such as better documentation and process definition) and higher use of innovation and automation. All of this works in the favor of the customer and in the case of a contractor or the vendor no longer being available, makes it easier for a replacement to hit the ground running.

· Moving from static knowledge-bases to automated SOPs – It is no longer enough to have standard operating procedures documented within Sharepoint-type portals. It is crucial to automate these static run books and documented SOPs via data center automation technologies, especially newer run book automation product sets (a.k.a. IT process automation platforms) that allow definition and utilization of embedded knowledge within the process workflows. These technologies allow contractors to move static process documentation to workflows that use this environmental knowledge to actually perform the work. Thus, the current process knowledge no longer merely resides in peoples’ heads, but gets moved to a central software platform thereby mitigating loss of key contractor personnel/vendors.

· Owning the IP associated with such process automation platforms – Frequently, companies that are using outsourced services ask “why should I invest in automation software? I have already outsourced our IT work to company XYZ. They should be buying and using such software. Ultimately, we have no control over how they perform the work anyway…” The Satyam situation is a classic example of why it behooves end-customers to actually purchase and own IP related to process automation software, rather than deferring it to the IT services partner. By having process IP defined within a software platform that the customer owns, it makes it conceivable to switch contractors and/or IT services firms. If the IT services firm owns the technology deployment, the corresponding IP walks out the door with the vendor preventing the customer from getting the benefit of the embedded process knowledge.

It is advisable for the customer to have some level of control and oversight over how the work is carried out by the vendor. It is fairly commonplace for the customer to insist on use of specific tools and processes such as ticketing systems, change control mechanisms, monitoring tools and so on. The process automation engine shouldn’t be treated any different. The bottomline is, whoever has the process IP carries the biggest stick during contract renewals. If owning the technology is not feasible for the customer, at least make sure that the embedded knowledge is in a format wherein it can be retrievable and reused by the next IT services partner that replaces the current one.

Friday, January 02, 2009

Zen and the Art of Automation Continuance

The new year is a good time to start thinking about automation continuance. Most of us initiate automation projects with a focus on a handful of areas – such as provisioning servers or databases, automating the patch process, and so on. While this kind of focus is helpful in ensuring a successful outcome for that specific project, it also has the effect of reducing overall ROI for the company – because once the project is complete, people move on to other day-to-day work patterns (relying on their usual manual methods), instead of continuing to identify, streamline and automate other repetitive and complex activities.

Just recently I was asked by a customer (a senior manager at a Fortune 500 company that has been using data center automation technologies, including HP/Opsware and Stratavia's Data Palette for almost a year) "how do I influence my DBAs to truly change their behavior? I thought they had tasted blood with automation, but they keep falling back to reactive work. How do I move my team closer to spending majority of their time on proactive work items such as architecture, performance planning, providing service level dashboards, etc.?” Sure enough, their DBA team started out impressively automating over half-a-dozen tasks such as database installs, startup/shutdown processes, cloning, tablespace management, etc., however during the course of the year, their overall reactive workload seems to have relapsed.

Indeed, it can seem an art to keep IT admins motivated towards continuing automation.

A good friend of mine in the Oracle DBA community, Gaja Krishna Vaidyanatha coined the phrase “compulsive tuning disorder” to describe DBA behavior that involves spending a lot of time tweaking parameters, statistics and such in the database almost to the point of negative returns. A dirty little secret in the DBA world is that this affliction frequently extends to areas beyond performance tuning and can be referred to as “compulsive repetitive work disorder”. Most DBAs I work with are aware of their malady, but do not know how to break the cycle. They see repetitive work as something that regularly falls on their plate and they have no option but to carry out. Some of those activities may be partially automated, but overall, the nature of their work doesn’t change. In fact, they don’t know how to change, nor are they incented or empowered to figure it out.

Given this scenario, it’s almost unreasonable for managers to expect DBAs to change just because a new technology has been introduced in the organization. It almost requires a different management model, laden with heaps of work redefinition, coaching, oversight, re-training and to cement the behavior modification, a different compensation model. In other words, the surest way to bring about change in people is to change the way they are paid. Leaving DBAs to their own devices and expecting change is not being fair to them. Many DBAs are unsure how any work pattern changes will impact their users’ experience with the databases, and whether that change will cost them their jobs even. It’s just too easy to fall back to their familiar ways of reactive work patterns. After all, the typical long hours of reactive work shows one as a hardworking individual, providing a sense of being needed and fosters notions of job security.

In these tough economic times however, sheer hardwork doesn’t necessarily translate to job security. Managers are seeking individuals that can come up with smart and innovative ways for non-linear growth. In other words, they are looking to do more with the same team - without killing off that team with super long hours, or having critical work items slip through the cracks.

Automation is the biggest enabler of non-linear growth. With the arrival of the new year, it is a good time to be talking about models that advocate changes to work patterns and corresponding compensation structures. Hopefully you can use the suggestions below to guide and motivate your team to get out of the mundane rut and continue down the path of more automation (assuming of course, that you have invested in broader application/database automation platforms such as Data Palette that are capable of accommodating your path).

1. Establish a DBA workbook with weights assigned to different types of activity. For instance, “mundane activity” may get a weight of say 30, whereas “strategic work” (whatever that may be for your environment) may be assigned a weight of 70. Now break down both work categories into specific activities that that are required in your environment. Make streamlining and automating repetitive task patterns an intrinsic part of the strategic work category. Check your ticketing system to identify accurate and granular work items. Poll your entire DBA team to fill in any gaps (especially if you don’t have usable ticketing data). As a starting point, here’s a DBA workbook template that I had outlined in a prior blog.

2. Introduce a variable compensation portion to the DBAs’ total compensation package (if you don’t already have one) and link that to the DBA workbook - specifically to the corresponding weights. Obviously, this will require you to verify whether DBAs are indeed living up to the activity in the workbook by having a method to evaluate that. Make sure that there are activity IDs and cause codes for each work pattern (whether it’s an incident, service request or whatever). Get maniacal about having DBAs create a ticket for everything they do and properly categorize that activity. Also integrate your automation platform with your ticketing system so you can also measure what kind of mundane activity are being carried out in a lights-out manner. For instance, many Stratavia customers establish ITIL-based run books for repetitive DBA activities within Data Palette. As part of these automated run-books, tickets get auto-created/auto-updated/auto-closed. That in turn will ensure that automated activities, as well as manual activities get properly logged and relevant data is available for end-of month (or quarterly or even annual) reconciliation of work goals and results – prior to paying out the bonuses.

If possible, pay out the bonuses at least quarterly. Getting them (or not!) will be a frequent reminder to the team regarding work expected of them versus the work they actually do. If there are situations that truly require the DBAs to stick to mundane work patterns, identify them and get the DBAs to streamline, standardize and automate them in the near future so they no longer pose a distraction from preferred work patterns.

Many companies already have bonus plans for their DBAs and other IT admins. However they link those plans to areas such as company sales, profits or EBITDA levels. Get away from that! Those areas are not “real” to DBAs. IT admins rarely have direct control on company revenue or spending levels. Such linking, while safer for the company of course (i.e., no revenue/profits, no bonuses), does not serve it well in the long run. It does not influence employee behavior other than telling them to do “whatever it takes” to keep business users happy and the cash register ringing, which in turn promotes reactive work patterns. There is no motivation or time for IT admins to step back and think strategically. But changing bonuses and variable compensation criteria to areas that IT admins can explicitly control – such as sticking to a specific workbook with more onus on strategic behavior – brings about the positive change all managers can revel in, and in turn, better profits for the company.

Happy 2009!

PS => I do have a more formal white-paper on this subject titled “5 Steps to Changing DBA Behavior”. If you are interested, drop me a note at “dbafeedback at stratavia dot com”. Cheers!

Sunday, September 07, 2008

Quantifying DBA Workload and Measuring Automation ROI

I mentioned in a prior blog entry that I would share some insight on objectively measuring DBA workload to determine how many DBAs are needed in a given environment. Recently, I received a comment to that posting (which I’m publishing below verbatim) the response to which promoted me to cover the above topic as well and make good on my word.

Here’s the reader’s comment:
Venkat,
I was the above anonymous poster. I have used Kintana in the past for doing automation of E-Business support tasks. It was very good. The hard part was to put the ROI for the investment.

My only concern about these analysts who write these reports is that they are not MECE (Mutually Exclusive, Collectively exhaustive). They then circulate it to the CIO's who use it to benchmark their staff with out all the facts.

So in your estimate, out of the 40 hours a DBA works in a week (hahaha), how many hours can the RBA save?

The reason I ask is that repetitive tasks take only 10-20% of the DBA's time. Most of the time is spent working on new projects, providing development assistance, identify issues in poorly performing systems and so on. I know this because I have been doing this for the past 14 years.

Also, from the perspective of being proactive versus reactive, let us take two common scenario's. Disk Failure and a craxy workload hijacking the system. The users would know it about the same time you know it too. How would a RBA help there?

Thanks
Mahesh
========

Here’s my response:

Mahesh,

Thanks for your questions. I’m glad you liked the Kintana technology. In fact if you found that (somewhat antiquated) tool useful, chances are, you will absolutely fall in love with some of newer run book automation (RBA) technologies, specifically database automation products that comply with RBA 2.0 norms like Data Palette. Defining a business case prior to purchasing/deploying new technology is key. Similarly, measuring the ROI gained (say, on a quarterly basis) is equally relevant. Fortunately, both of these can be boiled down to a science with some upfront work. I’m providing a few tips below on how to accomplish this, as I simultaneously address your questions.

Step 1] Identify the # of DBAs in your organization – both onshore and offshore. Multiple that number by the blended average DBA cost. Take for instance, a team of 8 DBAs – all in the US. Assuming the average loaded cost per DBA is $120K/year, we are talking about $960K being spent per year on the DBAs.

Step 2] Understand the average work pattern of the DBAs. At first blush it may seem that only 10-20% of a DBA’s workload is repeatable. My experience reveals otherwise. Some DBAs may say that “most of their time is spent on new projects, providing development assistance, identify issues in poorly performing systems and so on.” But ask yourself, what does that really mean? These are all broad categories. If you get more granular, you will find repeatable tasks patterns in each of them. For instance, “working on new projects” may involve provisioning new dev/test databases, refreshing schemas with production data, etc. These are repeatable, right? Similarly, “identifying issues in poorly performing systems” may involve a consistent triage/root cause analysis pattern (especially many senior DBAs tend to have a methodology for dealing with this in their respective environments) and can be boiled down to a series of repeatable steps.

It’s amazing how many of these activities can be streamlined and automated, if the underlying task pattern is identified and mapped out on a whiteboard. Also, I find that rather than asking DBAs “what do you do…”, ticketing systems often reveal a better picture. I recently mined 3 months worth of tickets from a Remedy system for an organization with about 14 DBAs (working across Oracle and SQL Server) and the following picture emerged (all percentages are rounded up):

- New DB Builds: 210 hours (works out to approx. 3% of overall DBA time)
- Database Refreshes/Cloning: 490 hours (7%)
- Applying Quarterly Patches: 420 hours (6%)
- SQL Server Upgrades (from v2000 to v2005): 280 hours (4%)
- Dealing with failed jobs: 140 hours (2%)
- Space management: 245 hours (3.5%)
- SOX related database audits and remediation: 280 hours( 4%)
- … (remaining data truncated for brevity…)

Now you get the picture… When you begin to add up the percentages, it should total 100% (otherwise you have gaps in your data; interview the DBAs to fill those gaps.)

Step 3] Once I have this data, I pinpoint the top 3 activities - not isolated issues like dealing with disk failure, but the routine tasks that the DBAs need to do multiple times each week, or perhaps each day – like the morning healthcheck, dealing with application faults such as blocked locks, transaction log cleanup, and so on.

Also, as part of this process, if I see that the top items’ description pertains to a category such as “working on new projects…”, I break down the category into a list of tangible tasks such as “provisioning a new database”, “compliance-related scanning and hardening”, etc.

Step 4] Now I’m ready to place each of those top activities under the microscope and begin to estimate how much of it follows a specific pattern. Note that 100% of any task may not be repeatable, but that doesn’t mean it cannot be streamlined across environments and automated - even a 20-30 percent gain per activity has huge value!

Once you add up the efficiency numbers, a good RBA product should allow you to gain anywhere from 20 to 40 percent overall efficiency gains - about $200K to $400K of higher productivity in the case of our example with 8 DBAs - which means, they can take on more databases without additional head-count or, support the existing databases in a more comprehensive manner (step 5 below). These numbers should be treated as the cornerstone for a solid business case, and for measuring value post-implementation – task by task, process by process.

(Note: The data from Step 2 can also be used to determine how many DBAs you actually need in your environment to handle the day-to-day workload. If you only have the activities listed and not the corresponding DBA time in your ticketing system, no worries… Have one of your mid-level DBAs (someone not too senior, not too junior) assign his/her educated guess to each activity in terms of number of hours it would take him/her to carry out each of those tasks. Multiple that by the number of times each task is listed in the ticketing system and derive a weekly or monthly total. Multiple that by 52 or 12 to determine the total # of DBA hours expended for those activities per year. Divide that by 2,000 (avg. # of hours/year/DBA) and you have the # indicating the requisite # of DBAs needed in your environment. Use a large sample-set from the ticketing system (say, a year) to avoid short-term discrepancies. If you don’t have a proper ticketing system, no problem – ask your DBA colleagues to track what they are working on within a spreadsheet for a full week or month. That should give you a starting point to objectively analyze their workload and build the case for automation technology or adding more head-count, or both!)

Step 5] Now sit down with your senior DBAs (your though-leaders) and identify all the tasks/activities that they would like to do more of, to stay ahead of the curve, avoid some of those frequent incidents and make performance more predictable and scalable - activities such as more capacity planning, proactive maintenance and healthchecks, more architecture/design work, working more closely with Development to avoid some of those resource-intensive SQL statements, use features/capabilities in newer DBMS versions, audit backups and DR sites more thoroughly, defining standard DBA workbooks, etc.) Also specify how will that help the business – in terms of reduced performance glitches and service interruptions, fewer war-room sessions and higher uptime, better statutory compliance and so on. The value-add from increased productivity should become part of the business case. One of my articles talks more about where the additional time (gained via automation) can be fruitfully expended to increase operational excellence.

My point here is, there’s no such thing in IT as a “one time activity”. When you go granular and start looking at end-to-end processes, you see a lot of commonalities. And then all you have to do is summarize the data, crunch the numbers, and boom - you get the true ROI potential nailed! Sounds simple, huh? It truly is.

Last but not least, regarding your two examples: “disk failure” and “a crazy workload hijacking the system” – those necessarily may not be the best examples to start when you begin to build an automation efficiency model. You need to go with the 80-20 rule - start with the 20% of the task patterns that take up 80% of your time. You refer to your use cases as “common scenarios”, but I’m sure you don’t have the failed disk problem occurring too frequently. If these above issues do happen frequently in your environment (assuming in the short term) and you have no control over them, other than reacting in a certain way, then as Step 3 suggests, let’s drill into how you react to them. That’s the process you can streamline and automate.

Let me use the “crazy workload” example to expound further. Say I’m the DBA working the early Monday morning shift and I get a call (or a ticket) from Help Desk stating that a user is complaining about “slow performance”. So I (more or less) carry out the following steps:

1. Identify which application is it (billing, web, SAP, Oracle Financials, data warehouse, etc.)
2. Identify all the tiers associated with it (web server, app server, clustered DB nodes, etc.)
3. Evaluate what kind of DB is the app using (say, a 2-node Oracle 10.2 RAC)
4. Run a healthcheck on the DB server (check on CPU levels, free memory, swapping/paging, network traffic, disk space, top process list, etc.) to see if anything is amiss
5. Run a healthcheck on the DB and all the instances (sessions, SQL statements, wait events, alert-log errors, etc.)
6. If everything looks alright from steps 4 and 5, I update the ticket to state the database looks fine and reassign the ticket to another team (sys admin, SAN admin, web admin team, or even back to the Help Desk for further analysis of the remaining tiers in the application stack).
7. If I see a process consuming copious amounts of I/O or CPU on a DB server, I check to see if it’s a DB-related process or some ad-hoc process a sys admin has kicked off (say, an ad-hoc backup in the middle of the day!). If it’s a database process, I check and see what that session is doing inside the database – running a SQL statement or waiting on a resource, etc. Based on what I see, I may take additional steps such as run an OS or DB trace on it – until I eliminate a bunch of suspects and narrow down the root cause. Once I ascertain symptoms and the cause, I may kill the offending session to alleviate the issue and get things back to normal - if it’s a known issue (and I have pre-approval to kill it). If I can’t resolve it then and there, I may gather the relevant stats, update the ticket and reassign it to the group that has the authority to deal with it.

As the above example shows, many of the steps above (specifically, 1 to 7) can be modeled as a “standard operating procedure” and automated. If the issue identified is a known problem, you can build a rule in the RBA product (assuming the product supports RBA 2.0 norms) to pinpoint the problem signature and link it to a workflow that will apply the pre-defined fix, along with updating/closing the ticket. If the problem is not a known issue, the workflow can just carry out steps 1 to 7, update the ticket with relevant details there and assign it to the right person or team. Now I don’t need to do these steps manually every time I get a call stating “there seems to be a performance problem in the database…” and more importantly, if it’s truly a database problem, I can now deal with the problem even before the end user experiences it and calls the Help Desk.

In certain other situations, when Help Desk gets a phone call about performance issues, they can execute the same triage workflow and either have a ticket created/assigned automatically or solve the issue at their level if appropriate. This kind of remediation avoids the need for further escalation of the issue and in many cases, avoids incorrect escalations from the Help Desk (how many times have you been paged for a performance problem that’s not caused by the database?). If the problem cannot be automatically remediated by the DBA (e.g., failed disk), the workflow can open a ticket and assign it to the Sys Admin or Storage team.

This kind of scenario not only empowers the Help Desk and lets them be more effective, but also reduces the workload for Tier 2/3 admin staff. Last but not least, it reduces a significant amount of false positive alerts that the DBAs have to deal with. In one recent example, the automation deployment team I was working with helped a customer’s DBA team go from over 2,000 replication-related alerts a month (80% of them were false positives, but needed to be looked at and triaged anyway…) to just over 400. I don’t know about you, but to me, that’s gold!

One final thing: this may sound somewhat Zen, but do look at an automation project as an ongoing journey. By automating 2 or 3 processes, you may not necessarily get all the value you can. Start with your top 2 or 3 processes, but once those are automated, audit the results, measure the value and then move on to the next top 2 or 3 activities. Continue this cycle until the law of diminishing returns kicks in (usually that involves 4-5 cycles) and I guarantee your higher-ups and your end-users alike will love the results. (More on that in this whitepaper.)

Wednesday, September 03, 2008

Clouds, Private Clouds and Data Center Automation

As part of Pacific Crest’s Mosaic Expert team, I had the opportunity to attend their annual Technology Leadership Forum in Vail last month. I participated in half-a-dozen panels and was fortunate to meet with several contributors in the technology research and investment arena. Three things seemed to rank high on everyone’s agenda: cloud computing and its twin enablers - virtualization and data center automation. The cloud juggernaut is making everyone want a piece of the action – investors want to invest in the next big cloud (pun intended!), researchers want to learn about it and CIOs would like to know when and how to best leverage it.

Interestingly, even “old-world” hosting vendors like Savvis and Rackspace are repurposing their capabilities to become cloud computing providers. In a similar vein InformationWeek recently reported some of the telecom behemoths like AT&T and Verizon with excess data center capacity have jumped into the fray with Synaptic Hosting and Computing as a Service - their respective cloud offerings. And to add to the mix, terms such as private clouds are floating around to refer to organizations that are applying SOA concepts to data center management making server, storage and application resources available as a service for users, project teams and other IT customers to leverage (complete with resource metering and billing) – all behind the corporate firewall.

As already stated in numerous publications, there are obvious concerns around data security, compliance, performance and uptime predictability. But the real question seems to be: what makes an effective cloud provider?

Google’s Dave Girourad was a keynote presenter at Pacific Crest and he touched upon some of the challenges facing Google as they opened up their Google Apps offering in the cloud. In spite of pouring hundreds of millions of dollars on cloud infrastructure, they are still grappling with stability concerns. It appears that size of the company and type of cloud (public or private) is less relevant, and more relevant is the technology components and corresponding administrative capabilities behind the cloud architecture.

Take another example: Amazon. They are one of the earliest entrants to cloud clouding and have the broadest portfolio of services in this space. Their AWS (Amazon Web Services) offering includes storage, queuing, database and a payment gateway in addition to core computing resources. Similar to Google, they have invested millions of dollars, yet are prone to outages.

In my opinion, while concerns over privacy, compliance and data security are legitimate and will always remain, the immediate issue is around scalability and predictability of performance and uptime. Clouds are being touted as a good way for smaller businesses and startups to gain resources, as well as for businesses with cyclical resource needs (e.g., retail) to gain incremental resources at short notice. I believe the current crop of larger cloud computing providers such as Amazon, Microsoft and Google can do a way better job with compliance and data security than the average startup/small business. (Sure, users and CIOs need to weigh their individual risk versus upside prior to using a particular cloud provider.) However for those businesses that rely on the cloud for their bread-and-butter operations whether cyclical or around-the-year, uptime and performance considerations are crucial. If the service is not up, they don’t have a business.

Providing predictable uptime and performance always boils down to a handful of areas. If provisioned and managed correctly, cloud computing has the potential to be used as the basis for real-time business (rather than being relegated to the status of backup/DR infrastructure.) But the key question that CIOs need to ask their vendors is: what is behind the so-called cloud architecture? How stable is that technology? How many moving parts does it have? Can the vendor provide component-level SLA and visibility? As providers like AT&T and Verizon enter the fray, they can learn a lot from Amazon and Google’s recent snafus and leverage technologies that can simplify the environment enabling it to operate in lights-out mode – making the difference behind a reliable cloud offering and one that’s prone to failures.

The challenge however, as Om Malik points out on his GigaOm blog, is that much of cloud computing infrastructure is fragile because providers are still using technologies built for a much less strenuous web. Data centers are still being managed with a significant amount of manual labor. “Standards” merely imply processes documented across reams of paper and plugged into Sharepoint-type portals. No doubt, people are trained to use these standards. But documentation and training doesn’t always account for those operators being plain forgetful, or even sick, on vacation or leaving the company and being replaced (temporarily or permanently) with other people who may not have the same operating context within the environment. Analyst studies frequently refer to the fact that over 80% of outages are due to human errors.

The problem is, many providers while issuing weekly press releases proclaiming their new cloud capabilities, haven’t really transitioned their data center management from manual to automated. They may have embraced virtualization technologies like VMware and Hyper-V, but they are still grappling with the same old methods combined with some very hard-working and talented people. Virtualization makes deployment fast and easy, but it also significantly increases the workload for the team that’s managing that new asset behind the scenes. Because virtual components are so much easier to deploy, it results in server and application sprawl and demands for work activities such as maintenance, compliance, security, incident management and service request management go through the roof. Companies (including the well-funded cloud providers) do not have the luxury of indefinitely adding head-count, nor is throwing more bodies at the problem always a good idea. They need to examine each layer in the IT stack and evaluate it for cloud readiness. They need to leverage the right technology to manage that asset throughout its lifecycle in lights-out mode – right from provisioning to upgrades and migrations, and everything in between.

That’s where data center automation comes in. Data center automation technologies have been around now for almost as long as virtualization and are proven to have the kind of maturity required for reliable lights-out automation. Data center automation products from companies such as HP (on the server, storage and network levels) and Stratavia (on the server, database and application levels) make a compelling case for marrying both physical and virtual assets behind the cloud with automation to enable dynamic provisioning and post-provisioning life-cycle management with reduced errors and stress on human operators.

Data center automation is a vital component of cloud computing enablement. Unfortunately, service providers (internal or external) that make the leap from antiquated assets to virtualization to the cloud without proper planning and deployment of automation technologies tend to provide patchy services giving a bad name to the cloud model. Think about it… Why can some providers offer dynamic provisioning and real-time error/incident remediation in the cloud, while others can’t? How can some providers be agile in getting assets online and keeping them healthy, while others falter (or don’t even talk about it)? Why do some providers do a great job with offering server cycles or storage space in the cloud, but a lousy job with databases and applications? The difference is, well-designed and well-implemented data center automation - at every layer across the infrastructure stack.

Wednesday, July 09, 2008

So, what's your “Database to DBA” Ratio?

The “Database to DBA” ratio is a popular metric for measuring DBA efficiency in companies. (Similarly, in the case of other IT admins, the corresponding “managed asset to admin” ratio (such as, "Servers to SA" ratio in the case of systems administrators) seems to be of interest.) What does such a metric really mean? Ever so often, I come across IT Managers bragging that they have a ratio of “50 DB instances to 1 DBA” or “80 DBs to 1 DBA”... -- Is that supposed to be good? And conversely, is a lower ratio such as “5 to 1” necessarily bad? Compared to what? In response, I get back vague assertions such as “well, the average in the database industry seems to be “20 to 1”. Yeah? Sez who??

Even if such a universal metric existed in the database or general IT arena, would it have any validity? A single DBA may be responsible for a hundred databases. But maybe 99 of those databases are generally “quiet” and never require much attention. Other than the daily backups, they pretty much run by themselves. But the remaining database could be a monster and may consume copious amounts of the DBA's time. In such a case, what is the true database to DBA ratio? Is it really 100 to 1 or is it merely 1 to 1? Given such scenarios, what is the true effectiveness of a DBA?

The reality is, a unidimensional *and* subjective ratio, based on so-called industry best practices, never reveals the entire picture. A better method (albeit also subjective) to evaluate and improve DBA effectiveness would be to establish the current productivity level ("PL") as a baseline, initiate ways to enhance it and carry out comparisons on an ongoing basis against this baseline. Cross-industry comparisons seldom make sense, however the PL from other high-performing IT groups in similar companies/industries may serve as a decent benchmark.

Let's take a moment to understand the key factors that should shape the PL. In this regard, an excellent paper titled “Ten Factors Affect DBA Staffing Requirements” written by two Gartner analysts, Ed Holub and Ray Paquet comes to mind. Based somewhat on that paper, I’m listing below a few key areas that typically influence your PL:

1. Rate of change (in the environment as indicated by new rollouts, app/DDL changes, etc.)
2. Service level requirements
3. Scope of DBA services (do the DBAs have specific workbooks, or are the responsibilities informal)
4. # of databases under management
5. Database sizes
6. Data growth rate
7. Staff skills levels
8. Process maturity (are there well-defined standard operating procedures for common areas such as database installation, configuration, compliance, security, maintenance and health-checks)
9. Tools standardization
10. Automation levels

In my mind, these factors are most indicative of the overall complexity of a given environment. Now let’s figure out this PL model together. Assign a score between 1 (low) to 10 (high) in each of the above areas as it pertains to *your* environment. Go on, take an educated guess.

Areas 1 to 6 form what I call the Environmental Complexity Score. Areas 7 to 10 form the Delivery Maturity Score. Now lay out an X-Y Line graph with the former plotted on the Y-axis and the latter plotted on the X-axis.

Your PL depends on where you land. If you picture the X-Y chart as comprising 4 quadrants (left top, left bottom, right top and right bottom), the left top is "Bad", the left bottom is "Mediocre", the right top is "Good" and the right bottom is "Excellent".

Bad indicates that your environment complexity is relatively high, but the corresponding delivery maturity is low. Mediocre indicates that your delivery maturity is low, but since the environment complexity is also relatively low, it may not be a huge issue. Such environments probably don't see issues crop up frequently and there is no compelling need to improve delivery maturity. Good indicates that your environmental complexity is high, but so is your delivery maturity. Excellent indicates that your delivery maturity is high even with the environment complexity being low. That means you are truly geared to maintain service levels even if the environment gets more complex in the future.

Another thing that Excellent may denote is that your delivery maturity helps keep environment complexity low. For instance, higher delivery maturity may enable your team to be more proactive/business-driven, actively implement server/db consolidation initiatives to keep server or database counts low. Or the team may be able to actively implement robust data archival and pruning mechanisms to keep overall database sizes constant even in the face of high data growth rates.

So, now you have a Productivity Level that provides a simplistic, yet comprehensive indication of your team's productivity, as opposed to the age-old "databases to DBA" measure. Also, by actively addressing the areas that make up Delivery Maturity, you have the opportunity to enhance your PL.

But this PL is still subjective. If you would like to have a more objective index around your team's productivity and more accurately answer the question "how many DBAs do I need today?", there is also a way to accomplish that. But more on that in a future blog.