Wednesday, July 09, 2008

So, what's your “Database to DBA” Ratio?

The “Database to DBA” ratio is a popular metric for measuring DBA efficiency in companies. (Similarly, in the case of other IT admins, the corresponding “managed asset to admin” ratio (such as, "Servers to SA" ratio in the case of systems administrators) seems to be of interest.) What does such a metric really mean? Ever so often, I come across IT Managers bragging that they have a ratio of “50 DB instances to 1 DBA” or “80 DBs to 1 DBA”... -- Is that supposed to be good? And conversely, is a lower ratio such as “5 to 1” necessarily bad? Compared to what? In response, I get back vague assertions such as “well, the average in the database industry seems to be “20 to 1”. Yeah? Sez who??

Even if such a universal metric existed in the database or general IT arena, would it have any validity? A single DBA may be responsible for a hundred databases. But maybe 99 of those databases are generally “quiet” and never require much attention. Other than the daily backups, they pretty much run by themselves. But the remaining database could be a monster and may consume copious amounts of the DBA's time. In such a case, what is the true database to DBA ratio? Is it really 100 to 1 or is it merely 1 to 1? Given such scenarios, what is the true effectiveness of a DBA?

The reality is, a unidimensional *and* subjective ratio, based on so-called industry best practices, never reveals the entire picture. A better method (albeit also subjective) to evaluate and improve DBA effectiveness would be to establish the current productivity level ("PL") as a baseline, initiate ways to enhance it and carry out comparisons on an ongoing basis against this baseline. Cross-industry comparisons seldom make sense, however the PL from other high-performing IT groups in similar companies/industries may serve as a decent benchmark.

Let's take a moment to understand the key factors that should shape the PL. In this regard, an excellent paper titled “Ten Factors Affect DBA Staffing Requirements” written by two Gartner analysts, Ed Holub and Ray Paquet comes to mind. Based somewhat on that paper, I’m listing below a few key areas that typically influence your PL:

1. Rate of change (in the environment as indicated by new rollouts, app/DDL changes, etc.)
2. Service level requirements
3. Scope of DBA services (do the DBAs have specific workbooks, or are the responsibilities informal)
4. # of databases under management
5. Database sizes
6. Data growth rate
7. Staff skills levels
8. Process maturity (are there well-defined standard operating procedures for common areas such as database installation, configuration, compliance, security, maintenance and health-checks)
9. Tools standardization
10. Automation levels

In my mind, these factors are most indicative of the overall complexity of a given environment. Now let’s figure out this PL model together. Assign a score between 1 (low) to 10 (high) in each of the above areas as it pertains to *your* environment. Go on, take an educated guess.

Areas 1 to 6 form what I call the Environmental Complexity Score. Areas 7 to 10 form the Delivery Maturity Score. Now lay out an X-Y Line graph with the former plotted on the Y-axis and the latter plotted on the X-axis.

Your PL depends on where you land. If you picture the X-Y chart as comprising 4 quadrants (left top, left bottom, right top and right bottom), the left top is "Bad", the left bottom is "Mediocre", the right top is "Good" and the right bottom is "Excellent".

Bad indicates that your environment complexity is relatively high, but the corresponding delivery maturity is low. Mediocre indicates that your delivery maturity is low, but since the environment complexity is also relatively low, it may not be a huge issue. Such environments probably don't see issues crop up frequently and there is no compelling need to improve delivery maturity. Good indicates that your environmental complexity is high, but so is your delivery maturity. Excellent indicates that your delivery maturity is high even with the environment complexity being low. That means you are truly geared to maintain service levels even if the environment gets more complex in the future.

Another thing that Excellent may denote is that your delivery maturity helps keep environment complexity low. For instance, higher delivery maturity may enable your team to be more proactive/business-driven, actively implement server/db consolidation initiatives to keep server or database counts low. Or the team may be able to actively implement robust data archival and pruning mechanisms to keep overall database sizes constant even in the face of high data growth rates.

So, now you have a Productivity Level that provides a simplistic, yet comprehensive indication of your team's productivity, as opposed to the age-old "databases to DBA" measure. Also, by actively addressing the areas that make up Delivery Maturity, you have the opportunity to enhance your PL.

But this PL is still subjective. If you would like to have a more objective index around your team's productivity and more accurately answer the question "how many DBAs do I need today?", there is also a way to accomplish that. But more on that in a future blog.


Ashutosh said...

Very interesting! I remember doing a similar exercise a couple of years back on "how may network administrators for x switches and y routers".

As you also mentioned, tools and automation are a great way to reduce the FTE count when it comes to service delivery and operations. Am curious if you also explored the % in effort/FTE reduction which an RBA tool would typically result in. Would be great to hear your thoughts on this.

Venkat Devraj said...

Thanks for the comment. Just to clarify, when I refer to the asset to admin ratio, it is from the perspective of the mundane aspects of service delivery and support. The goal is not necessarily to reduce FTE head-count, but to redeploy FTEs towards more strategic areas (planning, design, performance, availability and scalability optimization, etc.) - areas that are frequently disregarded (one of my pet peeves), but time and effort applied there pays off in spades! Hence I prefer to look at it as productivity gains, rather than strict FTE reduction.

In my experience, the productivity gains hover around 30% plus in the first 12 months with the application of a suitable RBA solution.

Hope this info helps,

Anonymous said...

The article from Gartner misses a big point. The maturity of the development organization is one of the big factors impacting DBA staffing. The higher the maturity (writing code that performs well and code that does not fail often -low defect ratio) the less amount of work the DBA has to do to identify and troubleshoot problems.

No amount of RBA will fix that...

Venkat Devraj said...

You are right in that app development maturity (or lack thereof) does impact DBA staffing. RBA cannot and *is not meant* to address that. All that RBA will do is to help you automate repetitive work patterns. If, in your company, the DBAs are stuck with application code that causes them to have to do a ton of troubleshooting, the DBAs may not be able to fix the application code in the short term, nor can they eliminate this type of troubleshooting. But they most definitely can leverage RBA 2.0 to recognize and deal with problem patterns before business users are impacted, and without continuing to suck up significant DBA and other admin time.

Obviously, we can't control everything in IT, but let's be smart about what we can indeed control!

Mahesh said...

I was the above anonymous poster. I have used Kintana in the past for doing automation of E-Business support tasks. It was very good. The hard part was to put the ROI for the investment.

My only concern about these analysts who write these reports is that they are not MECE (Mutually Exclusive, Collectively exhaustive). They then circulate it to the CIO's who use it to benchmark their staff with out all the facts.

So in your estimate, out of the 40 hours a DBA works in a week (hahaha), how many hours can the RBA save?

The reason I ask is that repetitive tasks take only 10-20% of the DBA's time. Most of the time is spent working on new projects, providing development assistance, identify issues in poorly performing systems and so on. I know this because I have been doing this for the past 14 years.

Also, from the perspective of being proactive versus reactive, let us take two common scenario's. Disk Failure and a craxy workload hijacking the system. The users would know it about the same time you know it too. How would a RBA help there?


Venkat Devraj said...

Thanks for reading my blog and for your comments. You pose some great questions. I'm afraid my response is going to be a tad lengthy. Rather than trying to fit it all in here, I will post it as a separate blog entry in the next day or two.