Sunday, September 07, 2008

Quantifying DBA Workload and Measuring Automation ROI

I mentioned in a prior blog entry that I would share some insight on objectively measuring DBA workload to determine how many DBAs are needed in a given environment. Recently, I received a comment to that posting (which I’m publishing below verbatim) the response to which promoted me to cover the above topic as well and make good on my word.

Here’s the reader’s comment:
Venkat,
I was the above anonymous poster. I have used Kintana in the past for doing automation of E-Business support tasks. It was very good. The hard part was to put the ROI for the investment.

My only concern about these analysts who write these reports is that they are not MECE (Mutually Exclusive, Collectively exhaustive). They then circulate it to the CIO's who use it to benchmark their staff with out all the facts.

So in your estimate, out of the 40 hours a DBA works in a week (hahaha), how many hours can the RBA save?

The reason I ask is that repetitive tasks take only 10-20% of the DBA's time. Most of the time is spent working on new projects, providing development assistance, identify issues in poorly performing systems and so on. I know this because I have been doing this for the past 14 years.

Also, from the perspective of being proactive versus reactive, let us take two common scenario's. Disk Failure and a craxy workload hijacking the system. The users would know it about the same time you know it too. How would a RBA help there?

Thanks
Mahesh
========

Here’s my response:

Mahesh,

Thanks for your questions. I’m glad you liked the Kintana technology. In fact if you found that (somewhat antiquated) tool useful, chances are, you will absolutely fall in love with some of newer run book automation (RBA) technologies, specifically database automation products that comply with RBA 2.0 norms like Data Palette. Defining a business case prior to purchasing/deploying new technology is key. Similarly, measuring the ROI gained (say, on a quarterly basis) is equally relevant. Fortunately, both of these can be boiled down to a science with some upfront work. I’m providing a few tips below on how to accomplish this, as I simultaneously address your questions.

Step 1] Identify the # of DBAs in your organization – both onshore and offshore. Multiple that number by the blended average DBA cost. Take for instance, a team of 8 DBAs – all in the US. Assuming the average loaded cost per DBA is $120K/year, we are talking about $960K being spent per year on the DBAs.

Step 2] Understand the average work pattern of the DBAs. At first blush it may seem that only 10-20% of a DBA’s workload is repeatable. My experience reveals otherwise. Some DBAs may say that “most of their time is spent on new projects, providing development assistance, identify issues in poorly performing systems and so on.” But ask yourself, what does that really mean? These are all broad categories. If you get more granular, you will find repeatable tasks patterns in each of them. For instance, “working on new projects” may involve provisioning new dev/test databases, refreshing schemas with production data, etc. These are repeatable, right? Similarly, “identifying issues in poorly performing systems” may involve a consistent triage/root cause analysis pattern (especially many senior DBAs tend to have a methodology for dealing with this in their respective environments) and can be boiled down to a series of repeatable steps.

It’s amazing how many of these activities can be streamlined and automated, if the underlying task pattern is identified and mapped out on a whiteboard. Also, I find that rather than asking DBAs “what do you do…”, ticketing systems often reveal a better picture. I recently mined 3 months worth of tickets from a Remedy system for an organization with about 14 DBAs (working across Oracle and SQL Server) and the following picture emerged (all percentages are rounded up):

- New DB Builds: 210 hours (works out to approx. 3% of overall DBA time)
- Database Refreshes/Cloning: 490 hours (7%)
- Applying Quarterly Patches: 420 hours (6%)
- SQL Server Upgrades (from v2000 to v2005): 280 hours (4%)
- Dealing with failed jobs: 140 hours (2%)
- Space management: 245 hours (3.5%)
- SOX related database audits and remediation: 280 hours( 4%)
- … (remaining data truncated for brevity…)

Now you get the picture… When you begin to add up the percentages, it should total 100% (otherwise you have gaps in your data; interview the DBAs to fill those gaps.)

Step 3] Once I have this data, I pinpoint the top 3 activities - not isolated issues like dealing with disk failure, but the routine tasks that the DBAs need to do multiple times each week, or perhaps each day – like the morning healthcheck, dealing with application faults such as blocked locks, transaction log cleanup, and so on.

Also, as part of this process, if I see that the top items’ description pertains to a category such as “working on new projects…”, I break down the category into a list of tangible tasks such as “provisioning a new database”, “compliance-related scanning and hardening”, etc.

Step 4] Now I’m ready to place each of those top activities under the microscope and begin to estimate how much of it follows a specific pattern. Note that 100% of any task may not be repeatable, but that doesn’t mean it cannot be streamlined across environments and automated - even a 20-30 percent gain per activity has huge value!

Once you add up the efficiency numbers, a good RBA product should allow you to gain anywhere from 20 to 40 percent overall efficiency gains - about $200K to $400K of higher productivity in the case of our example with 8 DBAs - which means, they can take on more databases without additional head-count or, support the existing databases in a more comprehensive manner (step 5 below). These numbers should be treated as the cornerstone for a solid business case, and for measuring value post-implementation – task by task, process by process.

(Note: The data from Step 2 can also be used to determine how many DBAs you actually need in your environment to handle the day-to-day workload. If you only have the activities listed and not the corresponding DBA time in your ticketing system, no worries… Have one of your mid-level DBAs (someone not too senior, not too junior) assign his/her educated guess to each activity in terms of number of hours it would take him/her to carry out each of those tasks. Multiple that by the number of times each task is listed in the ticketing system and derive a weekly or monthly total. Multiple that by 52 or 12 to determine the total # of DBA hours expended for those activities per year. Divide that by 2,000 (avg. # of hours/year/DBA) and you have the # indicating the requisite # of DBAs needed in your environment. Use a large sample-set from the ticketing system (say, a year) to avoid short-term discrepancies. If you don’t have a proper ticketing system, no problem – ask your DBA colleagues to track what they are working on within a spreadsheet for a full week or month. That should give you a starting point to objectively analyze their workload and build the case for automation technology or adding more head-count, or both!)

Step 5] Now sit down with your senior DBAs (your though-leaders) and identify all the tasks/activities that they would like to do more of, to stay ahead of the curve, avoid some of those frequent incidents and make performance more predictable and scalable - activities such as more capacity planning, proactive maintenance and healthchecks, more architecture/design work, working more closely with Development to avoid some of those resource-intensive SQL statements, use features/capabilities in newer DBMS versions, audit backups and DR sites more thoroughly, defining standard DBA workbooks, etc.) Also specify how will that help the business – in terms of reduced performance glitches and service interruptions, fewer war-room sessions and higher uptime, better statutory compliance and so on. The value-add from increased productivity should become part of the business case. One of my articles talks more about where the additional time (gained via automation) can be fruitfully expended to increase operational excellence.

My point here is, there’s no such thing in IT as a “one time activity”. When you go granular and start looking at end-to-end processes, you see a lot of commonalities. And then all you have to do is summarize the data, crunch the numbers, and boom - you get the true ROI potential nailed! Sounds simple, huh? It truly is.

Last but not least, regarding your two examples: “disk failure” and “a crazy workload hijacking the system” – those necessarily may not be the best examples to start when you begin to build an automation efficiency model. You need to go with the 80-20 rule - start with the 20% of the task patterns that take up 80% of your time. You refer to your use cases as “common scenarios”, but I’m sure you don’t have the failed disk problem occurring too frequently. If these above issues do happen frequently in your environment (assuming in the short term) and you have no control over them, other than reacting in a certain way, then as Step 3 suggests, let’s drill into how you react to them. That’s the process you can streamline and automate.

Let me use the “crazy workload” example to expound further. Say I’m the DBA working the early Monday morning shift and I get a call (or a ticket) from Help Desk stating that a user is complaining about “slow performance”. So I (more or less) carry out the following steps:

1. Identify which application is it (billing, web, SAP, Oracle Financials, data warehouse, etc.)
2. Identify all the tiers associated with it (web server, app server, clustered DB nodes, etc.)
3. Evaluate what kind of DB is the app using (say, a 2-node Oracle 10.2 RAC)
4. Run a healthcheck on the DB server (check on CPU levels, free memory, swapping/paging, network traffic, disk space, top process list, etc.) to see if anything is amiss
5. Run a healthcheck on the DB and all the instances (sessions, SQL statements, wait events, alert-log errors, etc.)
6. If everything looks alright from steps 4 and 5, I update the ticket to state the database looks fine and reassign the ticket to another team (sys admin, SAN admin, web admin team, or even back to the Help Desk for further analysis of the remaining tiers in the application stack).
7. If I see a process consuming copious amounts of I/O or CPU on a DB server, I check to see if it’s a DB-related process or some ad-hoc process a sys admin has kicked off (say, an ad-hoc backup in the middle of the day!). If it’s a database process, I check and see what that session is doing inside the database – running a SQL statement or waiting on a resource, etc. Based on what I see, I may take additional steps such as run an OS or DB trace on it – until I eliminate a bunch of suspects and narrow down the root cause. Once I ascertain symptoms and the cause, I may kill the offending session to alleviate the issue and get things back to normal - if it’s a known issue (and I have pre-approval to kill it). If I can’t resolve it then and there, I may gather the relevant stats, update the ticket and reassign it to the group that has the authority to deal with it.

As the above example shows, many of the steps above (specifically, 1 to 7) can be modeled as a “standard operating procedure” and automated. If the issue identified is a known problem, you can build a rule in the RBA product (assuming the product supports RBA 2.0 norms) to pinpoint the problem signature and link it to a workflow that will apply the pre-defined fix, along with updating/closing the ticket. If the problem is not a known issue, the workflow can just carry out steps 1 to 7, update the ticket with relevant details there and assign it to the right person or team. Now I don’t need to do these steps manually every time I get a call stating “there seems to be a performance problem in the database…” and more importantly, if it’s truly a database problem, I can now deal with the problem even before the end user experiences it and calls the Help Desk.

In certain other situations, when Help Desk gets a phone call about performance issues, they can execute the same triage workflow and either have a ticket created/assigned automatically or solve the issue at their level if appropriate. This kind of remediation avoids the need for further escalation of the issue and in many cases, avoids incorrect escalations from the Help Desk (how many times have you been paged for a performance problem that’s not caused by the database?). If the problem cannot be automatically remediated by the DBA (e.g., failed disk), the workflow can open a ticket and assign it to the Sys Admin or Storage team.

This kind of scenario not only empowers the Help Desk and lets them be more effective, but also reduces the workload for Tier 2/3 admin staff. Last but not least, it reduces a significant amount of false positive alerts that the DBAs have to deal with. In one recent example, the automation deployment team I was working with helped a customer’s DBA team go from over 2,000 replication-related alerts a month (80% of them were false positives, but needed to be looked at and triaged anyway…) to just over 400. I don’t know about you, but to me, that’s gold!

One final thing: this may sound somewhat Zen, but do look at an automation project as an ongoing journey. By automating 2 or 3 processes, you may not necessarily get all the value you can. Start with your top 2 or 3 processes, but once those are automated, audit the results, measure the value and then move on to the next top 2 or 3 activities. Continue this cycle until the law of diminishing returns kicks in (usually that involves 4-5 cycles) and I guarantee your higher-ups and your end-users alike will love the results. (More on that in this whitepaper.)

2 comments:

Anonymous said...

Venkat
Got it. This is Activity Based Costing 101. I did that when I managed my team. I have some interesting spreadsheets on this. All said and done, documentation of work by the DBA's and the amount of time spent on the work is crucial. "Stuff" has to be defined.

The amount of "stuff" will depend significantly on the rate of change and the maturity of the development organization and the Gartner Report does not mention that.

From a perspective of performance monitoring, DBA's would be well served if they adopted the methods of Hotsos. (http://www.hotsos.com/ ).

Don't waste time looking at Buffer Cache Hiot Ratio's.

Mahesh

Ashutosh said...

Hi Venkat

Your post was bang on!!! Having managed large infrastructure programs I always push my teams to get to this level of activity break-up to see how we can continously improve service levels and reduce effort.

One of things I am passionate about is doing the 80-20 problem management analysis to identify the top 3 issues which cause incidents in each service tower. Cull them, analyze them and see how we can avoid them or fix them next time with lesser effort.

I envision the data center of the future where 80% of all tasks will be automated and there would be a small team to oversee the exceptions that may be thrown out. It is a journey to that destination and we are surely getting their with some of the RBA products getting into mainstream gradually. One reason for slow adoption may be (and you can comment on this better) is that finally it will come to the mid-level or senior tech leads to analyze and recommend this technology. Many of these would make them redundant or reduce their "team size" and may not be attractive for other reasons that those purely technical that may be recommended on paper.