Tuesday, November 28, 2006

DBA Script Exodus to Python or Ruby Expected in the Near Term

During the last year and half, I have been observing an increase in the number of DBAs heading in the direction of learning and using “new age” scripting languages. I refer to Ruby and Python specifically. However this migration is not a new phenomenon; the move to such new age scripting was preceded by the migrations to the “axial age” from the “proprietary age” and in turn to the “open age”. And if anything, this migration towards Ruby and Python is not just a passing fad; I expect it to turn into a full-fledged exodus enabling higher scripter productivity and fostering more automation in the DBA world.

During the proprietary age, I’m sure most of us have encountered DBAs specializing in a single (proprietary) DBMS language like PL/SQL or T-SQL. The main problem with that approach was the “only tool being a hammer” situation (where, if the only tool you have is a hammer, every problem begins to look like a nail). DBAs started using their favorite proprietary language everywhere, even to address scripting needs outside the database. For instance, in a Windows/SQL Server environment, I would run into T-SQL routines doing OS-level backups via NetBackup. Granted you could do backups this way, however backups done via T-SQL would need to be driven off a SQL Server environment. This approach was fine for backing up database servers (and that too, only environments comprising SQL Server predominantly), however in the case of other (non-database) servers like application, mail and web servers this was grossly inefficient and somewhat of a misuse of T-SQL’s capabilities.

I saw this leading to the axial age, wherein DBAs began using DBMS independent scripting languages such as Korn Shell and VBScript. These routines served as the “axis” (axes) to connect a variety of operating system, database and application related functionality. This worked pretty well in handling both DBMS and non-DBMS routines and tasks alike, however they were usually confined to a particular operating system platform. For instance, Korn Shell scripts wouldn’t work on Windows and VBScript wouldn’t function on UNIX without a lot of heavy-lifting. (Certain 3rd parties came up with specialized wrappers to allow shell scripts to work on Windows, however that required separate purchase and installation of such layers and they had their fair share of problems in deployment, invocation and runtime exception handling.)

Thus came in the open age, wherein DBAs could go with an “open” scripting language such as Perl, Tcl/Tk and even PHP that would run on a plethora of operating system platforms (Windows and/or anything ending with an “x”) including thin clients. Most of these languages were great for writing one-off scripts however they severely lacked the strengths of a full-blown language, especially the newer ones such as C# or Java (especially real object-oriented capabilities such as inheritance and polymorphism.) Hence while their applicability was high, their reusability remained rather limited. DBAs had to pretty much start from scratch each time to write newer routines. At most, they could take an existing script and hack it until it met their requirements (turning it into a whole different script in the process.) There was no library of reusable code routines that they could reference to put together newer functionality using existing code blocks. That greatly restrained DBA productivity and also resulted in a plethora of one-off scripts – each of which needed to reside on each target server increasing complexity and maintenance overhead.

The new age scripting languages have emerged as a viable solution. Python and Ruby both provide an eclectic and formidable mix of the capabilities of the open age languages, combined with the versatility of the axial languages. They retain all the power of their scripting predecessors, yet put almost the entire power of Java and C# (more, some would argue) in the DBA’s hands. DBAs can truly write code libraries once and reference them (note, reference them, not copy and hack them) many times. A single line of Python or Ruby can do so much resulting in tightly written code yet with clean and elegantly stated constructs, and documented classes and methods, resulting in fairly easy to understand code. For instance, look at the following line of Ruby code:

1.upto(3) { databaseStartup.retry }

One doesn’t need years to scripting experience to understand what it’s attempting to do. One doesn’t have to mess around with complex structures and code just to manipulate a simple string. And best of all, there appears to be an unprecedented growth in communities of followers, around both Python and Ruby. Larger efforts based on these languages such as Ruby on Rails and Django are gaining mainstream support resulting in significant value-add to these languages via more reusable code libraries, classes, methods and documentation. Based on this momentum, neither of these languages appear to be going away anytime soon. So if you are a DBA that’s already embraced one or both of these scripting languages, kudos, you now have more options in your toolbox to create and deploy industrial strength automation, especially coupled with a centralization platform like Data Palette*. However if you are still on the fence or worse, dedicated to playing with isolated Korn shell scripts and T-SQL routines, it may be time to consider giving yourself an upgrade!


(* Incidentally, Data Palette’s SOP automation framework allows automation routines in any scripting language to be written, including T-SQL if that’s your flavor of choice. However it is my experience that the reusability and power of each SOP increases manifold by writing SOPs in Python or Ruby. Have fun scripting in the new age!)

Wednesday, November 22, 2006

Need for orchestration managers in IT automation

I was having a conversation earlier this week with the chief architect of a multinational (Top 3) IT services company regarding our company’s Data Palette automation product. He is no stranger to IT complexity since his company manages the IT departments and business processes of literally hundreds of organizations in the private, public and government sectors. He asked me an interesting question: if Data Palette could be used as an “orchestration platform” in calling different standard operating procedures not just related to database administration per se, but managing other components of the IT stack such as system level products and network level products.

That took me a bit by surprise since all along I have just been focused on database management while talking with him. I told him that Data Palette has traditionally been used to call other DB related product APIs and 3rd party routines within different SOPs (Data Palette’s standard operating procedures) if that 3rd party or DBMS product routine is the best way to accomplish a given task (for instance, a Data Palette SOP could call StatsPack if required for Oracle DBAs that are used to utilizing that functionality for macro-level performance diagnosis, or in the case of a more complex situation, execute an SOP based on receiving an SNMP trap from enterprise monitoring tools like HP Openview or Mercury Sitescope). He mentioned that while this was useful, Data Palette could potentially serve a larger role in the environment wherein its SOP Module could be deployed as a central administration and audit console and provide an abstraction layer around other niche applications.

Come to think of it, there's nothing to really stop an organization from using Data Palette (or any similar product) in such a capacity and I told him as such. In fact, this approach would help companies in two ways:
1. Allow companies that rely on a plethora of tools to simplify their environment by orchestrating tool usage via a central automation platform (Data Palette was the catalyst in my conversation, however any similar automation product could be used in this capacity).

2. Allow teams situated around the world, as well as newer hires to get upto speed faster by just having to learn one product interface, as opposed to having to come upto speed on several tools and point solutions.

Thinking further about it, one of the reasons automation efforts are seen as unrealistic in many complex environments is because of multiple people doing things their own way and using whichever tool they feel is appropriate for their situation. By having such an orchestration platform, it allows everyone (not just DBAs or specific administration staff) be on the same page about what tools are meant for what use and stick to that standard via a single command and control interface. If a team member really has a favorite tool for say, server provisioning, they can even introduce that tool to a larger audience via such orchestration and make that part of the standard. This encourages multiple team members to have a conversation about why a certain approach or tool is better for a task and win over others to their way of operating.

The chief architect’s suggestion is a stellar one that will allow IT environments with higher complexity to embrace automation in a faster and more meaningful manner. These are the kind of innovative thinkers and strong champions our IT industry in general, and automation efforts in particular need at this moment in dealing with complexity caused by silos of people and ongoing changes. It’s kinda like having your cake and eating it too by retaining the tools and technologies that one is most familiar with and weaving them into larger scale automation efforts.

Sunday, November 12, 2006

Autonomic Oracle 10g Installs? Forget about it!

DBAs and users alike eagerly await the benefits of automation and autonomics which will translate into less routine administration and faster completion of work. An autonomic database environment is marked by self-managing software including installs and maintenance, all the way to disaster recovery. But unfortunately, it seems software vendors like Oracle do everything they can to keep this goal elusive and breed an ever-increasing population of mundane and under-appreciated DBAs that need to stay married to their Blackberries and do a lot of nocturnal heavylifting.

Take my own very recent example. Last week, I was requested to install an Oracle 10g Release 2 database on a Linux RHEL 4 64-bit x86 server for a new Web application that needs to go live shortly. The install and database creation needed to be completed prior to Monday morning. Simple enough. I was so confident about this process which I had probably done dozens of times with many different versions of Oracle, I didn’t even start on the work until Saturday night (last night).

I installed the database in silent mode via the Unattended Install SOP within Data Palette (SOP = Standard Operating Procedure, a set of documented task plans and corresponding automation routines and workflows within the Data Palette automation platform). I was happy to see it work like a charm! (Look Ma, no hands...). Then I was off to create the database. Since I didn’t have a canned SOP to do the task (StrataVia is due to release a Database Creation SOP shortly), I figured I would do it manually. I started up sqlplus with the credentials ‘/ as sysdba’. Since I was logged into the box as oracle:dba, I expected to get to the SQL> command-line prompt so I could execute the CREATE DATABASE command. (Being an old-world sort of DBA, I prefer the command-line to the GUI tools like the Database Assistant that take 3 minutes just to start up.)

Immediately, I got an error message: “error while loading shared libraries: libaio.so.1: cannot open shared object file: No such file or directory
ERROR: ORA-12547: TNS:lost contact”.

Thinking that for some reason, the libaio RPMs weren’t available, I ran the “ldd” command to figure out the shared library dependencies:

$ ldd $ORACLE_HOME/bin/oracle

Here’s a subset of the output:
libaio.so.1 => file not found
libdl.so.2 => /lib64/libdl.so.2 (0x0000003d1b700000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003d1b900000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003d1bf00000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003d1ef00000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003d1b400000)
/lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000)

Aa ah! I found the problem (in bold above). The shared library that Oracle was expecting was not to be found. Maybe I figured our Sys Admin had not installed all the prerequisite RPMs beforehand. The Data Palette Unattended Install SOP does check for these RPMs and optionally install them, however I figured somehow one had been missed since our Sys Admin prefers to install RPMs manually. So I manually checked for the existence of the libaio RPM:

$ rpm –q libaio
libaio-0.3.105-2

OK, so it did exist. This was now puzzling. Why wasn’t Oracle using the libaio rpm that was on the system?

One thought was to just disable Async I/O and relink oracle so it didn’t try to utilize the libaio shared library.

(Side note, but one way I have done this in the past for 10g R2 is as follows:
$ su – oracle
$ . oranev
$ [ Ensure that all databases using this ORACLE_HOME and related services are down ]
$ cd $ORACLE_HOME/rdbms/lib
$ make –f ins_rdbms.mk async_off

[ Now update relevant init.ora parameters such as disk_asynch_io=false and filesystemio_options=none ].

Now one can confirm that async I/O is not in use by Oracle (or any other apps on the machine) by grepping the /proc/slabinfo file for “kioctx” and “kiocb” (egrep "kioctxkiocb" /proc/slabinfo) and ensuring the first couple of columns show zeros.

If other apps might need async I/O, you can check if Oracle is still using async I/O by rerunning the ldd command (ldd $ORACLE_HOME/bin/oracle grep libaio) to ensure that the output doesn’t still refer to the libaio file. The output of the nm command (nm $ORACLE_HOME/bin/oracle grep io_getevent) can also be verified to ensure any object file symbols are not referring to LIBAIO.) [END OF SIDE NOTE.]

In this particular situation, I couldn’t refrain from using AIO since that this Web application was expected to be I/O intensive and I wanted to keep both asynch I/O and direct I/O enabled. So the only option was to dig in and see why Oracle was not accepting the existing libaio RPM and see if I could install the right RPM version.

I called up our Sys Admin and explained what I was running into and requested for sudo access so I could deal with this shared library problem. I told him I would only reinstall or relink any libraries pertaining to oracle. As he graced me with sudo access, he reminded me that he would check the sudo logs to ensure I didn't mess with anything else, and also mentioned that he didn’t think the problem had to do with the Red Hat OS, and that all RPMs were backward compatible. He even referred to the Oracle Installation Guide that stated that ensure that the XYZ or higher version of an RPM existed before proceeding with the install.

Fair enough, the Oracle documentation can’t ever be wrong, right? I started Googling the “error while loading shared libraries: libaio.so.1:” error message and found quite a few 3rd party sites talking about having the libaio-0.3.96-3 RPM. However I had a newer version of this RPM. If the Oracle documentation was accurate, I shouldn’t be having this problem.

I then sudo’d in, copied the .src file for the older RPM from the Red Hat site and tried installing it. However it errored out stating a newer version already was installed. Then I retried using the “--replacepkgs” flag. Again to no avail. Finally after some vigorous head-scratching, I uninstalled (erased) the existing newer RPM and installed the older version.

That did the trick! The ldd and rpm -q commands revealed the following:

$ ldd $ORACLE_HOME/bin/oracle
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f43000)

$ rpm –q libaio
libaio-0.3.96-3

After this, I could create the database I wanted with the appropriate configuration. But by the time it was all done, it was past 3:30 in the morning. I couldn’t believe I had just spent almost 6 hours on this insignificant issue. Autonomic installs – yeah right! I couldn’t believe Oracle wouldn’t even bother updating its software install archive and documentation regarding this issue so our Sys Admin would have installed the right RPMs in the first place (or update the Data Palette Unattended Install SOP to check for and install the right RPM version).

Based on this experience, I feel even with their latest 10g release, Oracle has done very little in making their software self-managing such that even novices can install and maintain it with ease. During the recent OpenWorld 2006, they were already announcing Oracle 11g and its whopping 482 new features! (http://www.oracle.com/technology/events/oracle-openworld-2006/oow_tuesday.html).

I think I echo the sentiment of many tired DBAs when I say: give us a break guys! We don’t wanna see yet another new whizbang release loaded with cool marketing features (aka grid computing) that businesses aren’t quite ready to use. We don’t care about the newest release of OEM/Grid Control and its RAC monitoring features (which by the way, also have some serious bugs). Instead just give us a release that is stable when it comes to basic functionality (the 20% functionality that's used 80% of the time) and such that a layer of abstraction is provided around mundane administration. Make your documentation current, make the database a little more smarter to spare us the late nights and abuse associated with installing and managing your product so that DBAs can work on things that are more relevant to our, our customers’ and users' businesses and not worry about which version of a shared library Oracle is expecting in order to function smoothly.

For my part I have to state, at the risk of sounding perverse, I was actually somewhat glad to go through this pain. After this was done, I poured myself another cup of coffee and started documenting this problem and solution and sent it off to the Data Palette Engineering team at StrataVia so they could build in this problem scenario and the corresponding solution into their new Database Creation SOP so other users of Data Palette (especially less experienced personnel) do not have to battle the same issues I did last night. But talk about loss of sleep and wasted productivity the next day… Yeesh!!!

Tuesday, November 07, 2006

State of database tools is appalling!

When one thinks of “database tools”, an overly crowded image comprising lots and lots of GUI utilities and point solutions comes to mind. Given all these tools, one would think the database tools market is a mature one such that all that is to be invented in this area is already invented and available. Right? On the contrary, the truth was never further away.

Almost all of the tools in the market today focus on just two things:
1. Monitoring and alerting
2. Administrative GUIs to carry out ad-hoc/one-off tasks.

There are a handful of large companies out there that specialize in this area and they produce all kinds of tools – utilities to point and click and create users, compare schemas, add space, roll out releases, tune SQL, produce wait event graphs and other such performance metrics, etc. But at the end of the day, all of these tools belong to one or both of the above categories in various permutations and combinations. (You name a tool and I will tell you which of the above category it focuses on.) There is no earth-shattering innovation here. Add to this the fact that many DBAs prefer the flexibility of the command-line anyway for the two categories above and many of these tools eventually morph into shelfware grossly under-delivering on the investment made in them.

Don’t get me wrong – these large tools providers do their fair share of R&D (approx. 18 to 28% of their overall revenue, as per last year’s SEC filings of heavyweights BMC, Quest Software and Embarcadero… whoo hoo!). But it looks like most if not all of their R&D is relegated to figuring out better and nicer looking GUIs for monitoring. That’s great! I’m a big fan of usability improvements that result in nice intuitive interfaces, however think of this: if I want to create a user in one database, I might as well use the command-line (unless I’m too lazy to memorize the syntax or look it up in the online SQL Reference guide). If I’m dealing with say, half a dozen databases where I need to create the same user, I would love to use these fancy GUI tools and point and click my way to glory. But what happens if I need to propagate this same new user and privileges to a couple dozen databases, or even, hold your breath - a 100 databases? That would be a lot of pointing and clicking, no matter how easy to use that tool is. Now add to these 100 databases, a variety of mundane/repeatable tasks – such as installs, configs, refreshes and cloning, maintenance plans, patches, upgrades, migrations, managing load processes, DDL releases, and so on. That makes for a very busy DBA team working around the clock and no amount of point solutions and nice GUIs can change that. None of these mainstream tools are even architected for that kind of functionality and task replication.

Now some of these tools vendors are getting smart. They are introducing newer "automation" features. Yaay! But guess what, much of these automation features are built on top of the monitoring functionality they previously had. And they are restricted to certain vanilla automation or allowing a script to be kicked off whenever an alert fires off. In the latter case, the script still has to be written and tested by a DBA. DBAs in complex, heterogeneous environments are already super busy. They don’t have the time to write these complex scripts. The most they may do is look on Google for a particular script and if found, pick it up. But again, they don’t have the bandwidth to test these recently downloaded scripts in all the database environments they support. And oh by the way, the DBA that wrote or downloaded a particular script may be well versed with how it works, but none of the other DBAs in the team are likely to be familiar with it. So now each DBA starts building her own toolbox of scripts. Most of these scripts are not documented in a central location and there is no pre-defined (commonly agreed upon) methodology regarding which script to use in what circumstances. Worse yet, each of these scripts usually need to reside locally on each of the (100) database servers. Even if one line of code needs to change in one script to accommodate a change in the environment, someone needs to manually log into each of the DB servers and make the change. Not a bad approach to follow, if fat-fingering and typos were non-existent!!!

Now you begin to see what many veteran DBAs and IT managers have been seeing for years – the DBA tools industry is in a state of despair without any light at the end of the tunnel. All that’s coming into the tunnel is more monitoring tools – they have become more light-weight, they now carry out extra audit functions, etc. But at the end of the day, all they do is monitoring and ad-hoc administration.

In the meantime, some of the DBMS vendors have been introducing automation capabilities into their core products in recent years. However the bulk of that automation is also vanilla in nature and hence fairly limited in its capability. Let's understand what this "vanilla automation" means. Let's take for instance, Oracle's datafile auto-extend capability. Oracle has had this feature for some time now. However the feature is quite set in its way such that it just makes the datafile grow until it reaches a pre-set limit or until the underlying mount-point/drive gets full. It does not accommodate sys admin policies prevalent in most organizations such as implicit quotas on different mount-point (due to diverse applications and databases sharing the same storage devices as the DBMS) and gracefully growing onto the next appropriate mount-point. In other words, vanilla automation capabilities built into DBMS products and 3rd party tools do not necessarily accommodate custom IT policies and environmental rules. Any kind of custom policies need a script to be built by the DBA and that approach runs into the script-related problems mentioned above. Worse yet, some of these tools require the DBA to learn a proprietary scripting language before they can build any custom automation capability (a notable example is BMC Patrol). Yeah, like they have the time to do that...

Some of the DBMS vendors have also begun to provide DBA tools separate from their core DBMS product. But they are not exactly stellar either! For instance, Oracle has been dutifully releasing its Enterprise Manager product since version 7. It’s “intelligent agent” technology has been difficult to install, was unreliable and would fail often. At times, it would suck up more system resources than the database itself (a la BMC Patrol or an HP Openview DBSPI). I know of an old school DBA that has been burnt by OEM more than once on a production environment. He's sworn never to ever touch OEM with a barge pole no matter how much it evolves! It's just lost all credibility with him and others like him. He feels Oracle should stick to building databases rather than trying to do it all! Anyway, the latest incarnation of OEM, Grid Control is certainly more reliable and powerful than its predecessors. However its power mostly applies just to Oracle and that too, the latest versions of Oracle. For instance, one of the most powerful capabilities of Grid Control is dynamic RAC node provision and configuration for Oracle10g, but that doesn’t work for Oracle8i and Oracle9i. OEM claims to support SQL Server and IBM DB2 as well, but only for monitoring (no automation features are included) and that too for a hefty surcharge! Well, that’s no surprise right? Think about it, why would Oracle make it easy for customers to manage its competitors’ products? A lot of my customers (though they use Oracle databases for some of their applications) think that Oracle, being the crafty company that it always has been, is in the business of selling databases and applications and along the way, cutting off the oxygen for competing products while ostensibly supporting them.

So what’s a customer to do? Continue to participate in the great monitoring rollout from these behemoths and hire human DBAs by the truckloads to do the real work when the pagers go off? Sure, you need high quality DBAs. But merely adding head-count doesn’t scale. Talk to the companies that have 100s of DBAs. They have already tried that approach and are consumed in their own hodgepodge of problems…

There is still hope yet. There are some smaller innovative companies that are rising up to the challenge – namely, companies like Opalis, Appworx, Realops, Opsware (this one isn’t so small any more!) and StrataVia (warning, implicit marketing push, since this is the company I work for) with its Data Palette product. All of these companies are beginning to address the above problems in their particular area of focus – application administration, systems administration, database administration, et al. Take a look at what they have to offer and see if it’s suitable for your organization. Maybe it is, may it is not. But you owe it to yourself and your shareholders to see if you can use their newer offerings to go beyond mere monitoring and ad-hoc administration to actually improve the operating efficiency of your IT environment.

The coming years should be interesting to see which of these innovative companies and products the market adopts and embraces and which ones wither away. But more than anything, it would be intriguing to see how long the 800-pound DBMS tools behemoths continue to survive focusing on improving monitoring and ad-hoc administration alone.