Sunday, November 12, 2006

Autonomic Oracle 10g Installs? Forget about it!

DBAs and users alike eagerly await the benefits of automation and autonomics which will translate into less routine administration and faster completion of work. An autonomic database environment is marked by self-managing software including installs and maintenance, all the way to disaster recovery. But unfortunately, it seems software vendors like Oracle do everything they can to keep this goal elusive and breed an ever-increasing population of mundane and under-appreciated DBAs that need to stay married to their Blackberries and do a lot of nocturnal heavylifting.

Take my own very recent example. Last week, I was requested to install an Oracle 10g Release 2 database on a Linux RHEL 4 64-bit x86 server for a new Web application that needs to go live shortly. The install and database creation needed to be completed prior to Monday morning. Simple enough. I was so confident about this process which I had probably done dozens of times with many different versions of Oracle, I didn’t even start on the work until Saturday night (last night).

I installed the database in silent mode via the Unattended Install SOP within Data Palette (SOP = Standard Operating Procedure, a set of documented task plans and corresponding automation routines and workflows within the Data Palette automation platform). I was happy to see it work like a charm! (Look Ma, no hands...). Then I was off to create the database. Since I didn’t have a canned SOP to do the task (StrataVia is due to release a Database Creation SOP shortly), I figured I would do it manually. I started up sqlplus with the credentials ‘/ as sysdba’. Since I was logged into the box as oracle:dba, I expected to get to the SQL> command-line prompt so I could execute the CREATE DATABASE command. (Being an old-world sort of DBA, I prefer the command-line to the GUI tools like the Database Assistant that take 3 minutes just to start up.)

Immediately, I got an error message: “error while loading shared libraries: libaio.so.1: cannot open shared object file: No such file or directory
ERROR: ORA-12547: TNS:lost contact”.

Thinking that for some reason, the libaio RPMs weren’t available, I ran the “ldd” command to figure out the shared library dependencies:

$ ldd $ORACLE_HOME/bin/oracle

Here’s a subset of the output:
libaio.so.1 => file not found
libdl.so.2 => /lib64/libdl.so.2 (0x0000003d1b700000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003d1b900000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003d1bf00000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003d1ef00000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003d1b400000)
/lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000)

Aa ah! I found the problem (in bold above). The shared library that Oracle was expecting was not to be found. Maybe I figured our Sys Admin had not installed all the prerequisite RPMs beforehand. The Data Palette Unattended Install SOP does check for these RPMs and optionally install them, however I figured somehow one had been missed since our Sys Admin prefers to install RPMs manually. So I manually checked for the existence of the libaio RPM:

$ rpm –q libaio
libaio-0.3.105-2

OK, so it did exist. This was now puzzling. Why wasn’t Oracle using the libaio rpm that was on the system?

One thought was to just disable Async I/O and relink oracle so it didn’t try to utilize the libaio shared library.

(Side note, but one way I have done this in the past for 10g R2 is as follows:
$ su – oracle
$ . oranev
$ [ Ensure that all databases using this ORACLE_HOME and related services are down ]
$ cd $ORACLE_HOME/rdbms/lib
$ make –f ins_rdbms.mk async_off

[ Now update relevant init.ora parameters such as disk_asynch_io=false and filesystemio_options=none ].

Now one can confirm that async I/O is not in use by Oracle (or any other apps on the machine) by grepping the /proc/slabinfo file for “kioctx” and “kiocb” (egrep "kioctxkiocb" /proc/slabinfo) and ensuring the first couple of columns show zeros.

If other apps might need async I/O, you can check if Oracle is still using async I/O by rerunning the ldd command (ldd $ORACLE_HOME/bin/oracle grep libaio) to ensure that the output doesn’t still refer to the libaio file. The output of the nm command (nm $ORACLE_HOME/bin/oracle grep io_getevent) can also be verified to ensure any object file symbols are not referring to LIBAIO.) [END OF SIDE NOTE.]

In this particular situation, I couldn’t refrain from using AIO since that this Web application was expected to be I/O intensive and I wanted to keep both asynch I/O and direct I/O enabled. So the only option was to dig in and see why Oracle was not accepting the existing libaio RPM and see if I could install the right RPM version.

I called up our Sys Admin and explained what I was running into and requested for sudo access so I could deal with this shared library problem. I told him I would only reinstall or relink any libraries pertaining to oracle. As he graced me with sudo access, he reminded me that he would check the sudo logs to ensure I didn't mess with anything else, and also mentioned that he didn’t think the problem had to do with the Red Hat OS, and that all RPMs were backward compatible. He even referred to the Oracle Installation Guide that stated that ensure that the XYZ or higher version of an RPM existed before proceeding with the install.

Fair enough, the Oracle documentation can’t ever be wrong, right? I started Googling the “error while loading shared libraries: libaio.so.1:” error message and found quite a few 3rd party sites talking about having the libaio-0.3.96-3 RPM. However I had a newer version of this RPM. If the Oracle documentation was accurate, I shouldn’t be having this problem.

I then sudo’d in, copied the .src file for the older RPM from the Red Hat site and tried installing it. However it errored out stating a newer version already was installed. Then I retried using the “--replacepkgs” flag. Again to no avail. Finally after some vigorous head-scratching, I uninstalled (erased) the existing newer RPM and installed the older version.

That did the trick! The ldd and rpm -q commands revealed the following:

$ ldd $ORACLE_HOME/bin/oracle
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f43000)

$ rpm –q libaio
libaio-0.3.96-3

After this, I could create the database I wanted with the appropriate configuration. But by the time it was all done, it was past 3:30 in the morning. I couldn’t believe I had just spent almost 6 hours on this insignificant issue. Autonomic installs – yeah right! I couldn’t believe Oracle wouldn’t even bother updating its software install archive and documentation regarding this issue so our Sys Admin would have installed the right RPMs in the first place (or update the Data Palette Unattended Install SOP to check for and install the right RPM version).

Based on this experience, I feel even with their latest 10g release, Oracle has done very little in making their software self-managing such that even novices can install and maintain it with ease. During the recent OpenWorld 2006, they were already announcing Oracle 11g and its whopping 482 new features! (http://www.oracle.com/technology/events/oracle-openworld-2006/oow_tuesday.html).

I think I echo the sentiment of many tired DBAs when I say: give us a break guys! We don’t wanna see yet another new whizbang release loaded with cool marketing features (aka grid computing) that businesses aren’t quite ready to use. We don’t care about the newest release of OEM/Grid Control and its RAC monitoring features (which by the way, also have some serious bugs). Instead just give us a release that is stable when it comes to basic functionality (the 20% functionality that's used 80% of the time) and such that a layer of abstraction is provided around mundane administration. Make your documentation current, make the database a little more smarter to spare us the late nights and abuse associated with installing and managing your product so that DBAs can work on things that are more relevant to our, our customers’ and users' businesses and not worry about which version of a shared library Oracle is expecting in order to function smoothly.

For my part I have to state, at the risk of sounding perverse, I was actually somewhat glad to go through this pain. After this was done, I poured myself another cup of coffee and started documenting this problem and solution and sent it off to the Data Palette Engineering team at StrataVia so they could build in this problem scenario and the corresponding solution into their new Database Creation SOP so other users of Data Palette (especially less experienced personnel) do not have to battle the same issues I did last night. But talk about loss of sleep and wasted productivity the next day… Yeesh!!!

4 comments:

Anonymous said...

Thanks for taking the time to write this problem up. I just ran into the same thing and your solution worked perfectly. Keep up the good work.

Unknown said...

Thanks a lot posting this article. It was very helpful resolving the issue and saved quite a lot of time and effort. I appreciate this good work.

Anonymous said...

Hi, FYI Oracle has now published as a known issue for Linux RH64bit, Oracle MetaLink "Note 394297.1: Libaio.So.1: Cannot Open Shared Object File and ORA-12547: TNS:Lost Contact"

Thank you for the excellent documentation of the solution and the explanations along the way.

Good Luck Out There--
G** from Oracle Support [after a customer pointed out this solution.]

Anonymous said...

Thank you for the excellent technical analysis, great work!!
Kenn