Monday, July 05, 2010

Preparing the IT Organization for the Internal Cloud

A recent article on Cloud Computing quoting IBM’s Ric Telford caught my attention. He boldly predicts that the term Cloud Computing will become obsolete over the next 5 years while the underlying methodology will become standard practice for IT. While this level of IT maturation truly seems a few years out, I find internal or private clouds to be increasingly prevalent in certain pockets of industries such as financial services, energy and managed services. On the other hand, IT departments in many other industries such as manufacturing, retail and telecom continue to struggle to enable such capabilities. The struggles have not emanated necessarily from lack of desire, budget or technology leadership, but from cultural challenges such as the following:

(a) Perceived shift in power dynamics. Internal cloud and self-service delivery models make it unclear regarding where does traditional IT Operations stop and end-users (service consumers) start driving the delivery process. Many IT Operations personnel fear that end-users will need to be granted administrator privileges to enable self-service and it will be difficult to control end-user activity. End-users will not adhere to well-defined IT standards and the requisite training and oversight will dramatically increase Operations’ workload.

(b) The lines between IT silos become blurry. When an end-user group asks for an IT service, those services are traditionally broken down into specific requests in the form of tickets and assigned to specific silos such as Sys Admin, Storage Admin, DBAs, etc. The requests are then individually delivered, often with lots of coordination back and forth with the final service being received and validated by the end-user team. While a cloud-based delivery model aims to improve such out-dated modes of functioning, it is also susceptible to severe internal bottlenecks and confusion regarding what is the right unit of delivery and who needs to own it: the application admins at the top of the tier, or the sys admins at the bottom. On the other hand, average service recipients do not care about the individual silos that make up their request; they speak in the "language" of the application whereas traditional IT Ops speaks in the language of servers, storage, database and other infrastructure elements. Thus the lack of a common denominator for requesting and delivering self-service capabilities in many organizations arrests momentum pertaining to the deployment of internal cloud capabilities.

In other words, the true barrier to enabling a private cloud is not technology, but cultural apprehensions that old world IT bridges by throwing more bodies and silos at the problem – repetitive meetings and water cooler conversations within different IT personnel and application teams provide a framework for delivery. It’s a lot of wasted effort, but the old world process delivers the services (albeit it takes days, weeks or months!)
 
CIOs that are ultimately successful in having IT groups embrace internal cloud and self-service models solve these bottlenecks in a way that’s amenable to the overall organization culture. Some CIOs do it via baby steps by retaining existing IT structure and policies and offering specific infrastructure components such as servers and databases in self-service mode to project teams. In other words, they attempt to transform IT by offering self-service capabilities on a silo by silo basis. However that is akin to a restaurant delivering specific pre-cooked ingredients to customers and expecting them to compile those ingredients into the requisite meal. Obviously such “self-service” (if it can still be labeled that) IT delivery models require the end-user to be fairly sophisticated in asking for the right resources and being able to assemble them all together themselves. This also assumes that the end-user is disciplined enough to ask for these ingredients in the right quantity (i.e., they are not wasteful or overly cautious). These assumptions rarely hold true causing the success of such internal cloud strategies to be stunted at best.
 
A better approach might be to take the time upfront to define a common denominator (across IT silos) and corresponding units for self-service – via a macro-level reference architecture. Take any candidate end-user request or task that can be delivered in self-service mode: e.g., application provisioning, code releases, creating data copies or backups, restoring snapshots, creating users, resetting passwords, resolving incidents, etc. All of these require a universal definition of what’s to be delivered to the end-user. All underlying nuances such as IT silos, their lines of control and associated complexities need to be abstracted from the end-user. The more comprehensively IT is hidden behind the curtains, the better is the private cloud implementation.
 
Based on my work with multiple IT organizations across different industries, I find that the most pragmatic unit for self-service comes from the top-most layer of the IT stack – the application. However utilizing application-specific units is easier said than done since in most large organizations (the kind that can really benefit from a private cloud), IT operations doesn’t speak the forementioned “application language”. Even if they do, there are several applications in use and most applications have multiple parts and pieces (i.e., they are all “composite applications”). In order to make application-specific units more palatable to all IT silos, the underlying descriptors of the units have to reference physical infrastructure attributes and map it to their corresponding logical (app-specific) definitions within the reference architecture.
 
Most applications that have been developed over the last decade are 3-tiered. They have a web server tier, an application server tier and a database tier, along with associated sub-components such as a load-balancer tied to a farm of web servers. Hence this popular 3-tier “application template" acts as a robust common denominator for IT. By building self-service capabilities around such a template, most applications within the enterprise are addressed. Of course, there are certain applications that employ just one or two tiers: e.g., server-based batch applications, client/server reporting applications and so on. As long as the tiers used by these disparate applications are a sub-set of the chosen application template, they are adequately covered. On the same lines, some applications employ a use of unique middleware components such as a transaction processing monitor, a message bus, queues, SMTP services, etc. Obviously such applications would fall outside the 3-tier template and would need to be dealt with separately. However with a bit of deeper investigation, I usually find that such applications are vastly outnumbered by the standard application stack that employs some or all of the 3 tiers named above. Hence such exceptions do not make an appreciable difference (in most cases) to the notion of using a single master application template in the reference architecture.
 
By applying the 80-20 rule, one can keep the application template relatively straightforward in the initial iterations of the private cloud implementation and focus on those applications and tasks to be delivered in self-service mode that will have the biggest impact. I have seen this approach work well even in larger organizations with multiple customers or lines of businesses (LOBs) because the underlying definition (of the makeup) of the application remains consistent across these disparate LOBs.
 
Once the master application template is defined, each unique application type can be assigned a profile that describes the tiers it employs. For instance, AppProfile1 (or simply, Profile1) can refer to applications that always have a web server, application server and a database instance, whereas Profile2 can refer to applications that utilize a 2-tier model (just an app server and a database), and so on. Using concepts of polymorphism, all these profiles should refer to the master application template such that any changes to the base templates will be immediately reflected across all profiles. During future iterations, additional templates and profiles can be defined to refer to applications that may have additional tiers such as a message bus. These definitions can be laid out on paper (and later implemented using an automation tool), or ideally, set up within an automation platform such as Stratavia’s Data Palette from the very beginning.
 
Regardless of the application profiles (the fewer the better in the initial phase), the deployment needs to adhere to one key point: keep the individual components for each tier hidden from the end-user while giving IT teams full control over those components. By “components”, I’m referring to the following:
 
(i) infrastructure layers: i.e., servers (virtual or physical), storage (SAN, NAS, etc., depending on IT standards and budget sizes), networking capacity, etc.; and

(ii) software layers: the web server, application server and database instance. Compared to physical infrastructure like servers and storage, each of these software layers often encompass a “best of breed” configuration across a broader variety of vendors, proprietary platforms and open source frameworks (such as Apache Tomcat, Microsoft IIS, WebLogic, Websphere, Oracle DBMS, SQL Server, SAP Sybase, VMware Springsource, etc.), and hence tend to be much more complex and time consuming for provisioning, configuration and ongoing management. Hence the overall success of the deployment calls for limiting the number of application templates and profiles during its initial phase.

Each application profile should have a description of what quantities of which specific components are needed by each tier - for initial provisioning as well as subsequent (incremental) provisioning. For instance, a ‘web server’ tier in the Sales Force Automation (SFA) suite may comprise a virtual machine with X amount of CPU, Y amount of disk space and a standard network configuration running Apache Tomcat. When an authorized end-user (e.g., a QA Manager) asks for the SFA application to be delivered to a new QA environment, the above web server unit gets delivered along with the corresponding application server and database instance – all of them pre-configured in the right units and ready for the end-user to access! However when an authorized end-user asks for more web servers to boost performance in the production SFA environment, what gets delivered is N units of only the web server tier, with specific pre-defined configuration to tie it to the corresponding (previously existing) application server and/or database. All of these mappings may be held in a CMDB (ideally!) or in some kind of operations management database and referenced at run-time by the automation workflows that are responsible for service delivery.
 
The application templates and profiles with quantitative resource descriptions allow end users to receive services in an application-friendly language, while allowing IT Ops to provision and control individual pieces of server, network, disk space, database, etc., using their existing IT standards and vocabulary. All underlying components blend together into a single pre-defined unit, enabling the well-oiled execution of a self-service delivery model within the private cloud implementation.
 
Using application provisioning as an example, the below graphic (click on it for an enlarged version) depicts how various application profiles comprising multiple software and infrastructure layers can be defined.
 

 
Regarding the other previously stated bottleneck (i.e., the perceived shift in power dynamics), while IT Ops personnel are behind the curtain, it doesn’t mean they are any less influential. Using automation products such as Data Palette, they are able to exert control both in the initial stages of service definition (along with the Operations Engineering or Architecture groups), as well as during the ongoing service delivery process by monitoring and tuning the underlying automation workflows (again, in conjunction with Operations Engineering) as depicted in the graphic below. In other words, they define and control what automation gets deployed and what is the exact sequence of steps for delivering a particular component of an application in the environments they manage. Specific product capabilities such as multi-tenancy, granular privileges and role based access control allows them to grant end-users the ability to see, request and perform specific activities in self-service mode without the need to also grant any administrative privileges they could potentially misuse.
 
 
 
As the IT Ops personnel continually assess areas of inefficiency, they can refine the workflows and roll out improvements with little to no impact to end users. Such efforts, while changing the job description of what IT Ops does on a day-to-day basis, help to deliver a better end-user experience helping companies leapfrog from the IT of yesteryears to the “industrialization of IT” that IBM’s Telford refers to.

Tuesday, February 23, 2010

What’s the “Right Way” to Automate the Application Stack?

Cloud computing is forcing IT organizations to rethink automation. Early adopters started out with the delivery of servers, storage and network connectivity in self-service mode via an internal cloud. While this helped reduce service delays, real service level improvements weren’t forthcoming. Application owners and end users were still experiencing significant elapsed times between service request and delivery. It was becoming obvious to the IT thought-leaders that merely provisioning and offering up “ping, power and pipe” quickly wasn’t going to have a meaningful impact. That just shifted the bottleneck up the stack into the application layers. The new focus of the cloud revolves around rapid deployment and mass management of applications. Without the ability to provision and manipulate discrete application services, the value of the cloud is stunted.

As the recent acquisition of Phurnace indicates, the traditional data center tools vendors have started figuring it out and are attempting to offer solutions that deliver automation aimed at the application layers. The advent of these vendors and their mainstay strength in server provisioning is bringing forth multiple interesting approaches to application automation. Approaches vary from the composite policy-based (but hands-off) VMware vApps strategy to the automation of selective app admin tasks for specific app components such as Java code deployment (e.g., Phurnace) to broader automation platforms with out-of-the-box modular content to address the entire administrative lifecycle of various app components (e.g., Stratavia). The key question that emerges for customers is “what’s the right way to automate applications?” Conventional wisdom says if your only weapon is a hammer, every problem looks like a nail. Armed with robust server virtualization and provisioning toolsets, some of the larger vendors are approaching the application stack with (no surprises!) a focus on provisioning. However at the risk of sounding clichéd, I have to say that automating application administration is a very different paradigm. 

You see, at the server layer and below (including storage and networking), significant admin time is taken up in provisioning. Post-provisioning activities such as patching and configuration management are often handled via provisioning – i.e., by reimaging the OS with an updated (patched/reconfigured) image. So all in all, provisioning is the key administrative function at these lower layers.

However as you go above the server, you encounter discrete application layers such as webservers, application servers and databases – to name the most popular components. Each of these dictate a different operations lifecycle that is not dominated by provisioning and configuration management. In fact, provisioning barely takes up 15-20% of the typical App Admins’ time. The remaining 80% time is spent on other post-provisioning activities such as maintenance, incident response and service requests (the graphic below provides examples of these task categories).

But then the traditional vendors ask, why do these App Admins do all these things in the first place? Why can’t they, like Sys Admins and Network Admins, handle these other activities via Provisioning/Re-provisioning? For instance, instead of applying a new patch or performing a maintenance operation or doing a code release, why can’t the App Admin just provision a new image that has the desired changes? That will eliminate the need for these other post-provisioning activities and free up App Admin time.

This type of argument does not work because it goes against the grain of real-world application management. Instances of an application component frequently develop a unique fingerprint over the course of their use. This fingerprint is based on several factors including security requirements, performance adjustments, and the experience and skill-level (a.k.a. best practices) of the Admins managing them. To make things worse, these fingerprints can be dynamic in nature. The same application may look different at different times of the day. For instance, a database may serve as a transactional data-store during regular business hours and may be configured specifically to facilitate smaller read/write I/O operations, whereas at night time, the same database may be converted into a batch database with different buffer sizes, log file locations, etc. to facilitate bulk writes. Bland categorization and re-imaging of the application server (say, to apply a new application patch) by the Sys Admins causes much of this dynamic application fingerprint to be lost - creating in turn a lot more work for the App Admins (to restore the fingerprint as much as they can – assuming they themselves remember all the changes and can get it right the first time!)

The other challenge is that a single server frequently hosts multiple application types and instances. This is commonly encountered even in large-scale production environments. Each of these instances may have their own maintenance window and need to be patched/ reconfigured/upgraded and managed individually based on instance-specific standards and dependencies. Reimaging (the entire server) doesn’t afford granular control of individual instances. Now one may argue that this was more prevalent in the olden days when servers were expensive and with virtualization being commoditized now, each application instance can reside on its own server image thereby eliminating this problem. But reality goes deeper than that. It’s not just server resource optimization that called for multiple application types and instances to reside on the same server; cohabitation was also tied to performance, security and other considerations. For example, in the case of a performance sensitive application that utilizes federated databases, a DBA may elect to keep some of the associated databases on the same physical OS to minimize context switching and network latency incurred due to physical separation of the databases onto different servers. Regardless of whether the servers and underlying network adaptors are physical or virtual, the location of the underlying databases can make a big difference in terms of response time for a data-join operation being in sub-seconds versus minutes. Thus without proper understanding of the various application types, and related design considerations (such as transaction types, application access methods, data volume, affinity, partitioning, etc.) and best practices, choosing the wrong automation method can result in degraded service levels.

Attempting to offer application management in the cloud with just an application image provisioning model is akin to showing up to a gun fight with a rock. Proponents of this model can claim that rapid provisioning and policy-based reimaging of the relevant application components is the new way of application management in the cloud. While this approach may work for a handful of admin functions, it does not offer granular control or a pragmatic framework for most mandatory post-provisioning tasks (represented in the graph above) and hence will be discarded by savvy App Admins. Only solution providers that have a true application management DNA (with a deep understanding of task patterns associated with various application components) and offering automated application management capabilities out-of-the-box can win legitimate mind-share in the near term and sustainable market share in the long run.

Thursday, November 12, 2009

Reference Architecture for Delivering IT as a Service

I’m back now after a 4-month “hiatus” that comprised multiple customer engagements – productive activity that keeps me (and my rants here) relevant. A key area I have been working on is using Stratavia’s Data Palette to help my customers deliver IT as a service within their organizations – or to be more precise, helping them deliver applications as a service. From an “end-user” perspective such as project managers, application team leads and QA managers (i.e., the recipients of these services), individual components of the infrastructure (plain servers or raw storage) don’t matter; it’s about having fully baked apps appropriately packaged and delivered - including the webserver, the application server and the database layers.

The obvious reason CIOs are looking to upgrade their IT delivery capabilities is around improving business efficiency and agility, while of course reducing costs. But a less frequently cited, but equally vital reason is to keep up with the competition! For instance, every financial services firm out there has already built or is building an internal cloud. In fact, larger organizations across all industry verticals are taking the next step to attain scalability via newer delivery models such as self-service and cloud computing. But to truly gain tangible benefit from such scalable models is a challenge. And currently, application administration is the weakest link in the chain!

The whole premise behind cloud computing of being able to rapidly mass-deploy applications in the cloud frequently comes to a screeching halt due to the way IT currently operates. Most IT administration teams are not just geared up to provision and manage scores of complex heterogeneous application tiers in an agile manner – unless more and more manpower is added and even that doesn’t scale after a certain point. Sure there is help from the conventional systems management tools vendors like HP, BMC, IBM, EMC, VMware and Cisco. Automation products from these venerable vendors are able to help organizations reduce server, network and storage provisioning time from multiple weeks to a few hours. But then the bottleneck just shifts upstream into the application layers: specifically the middleware and the database tiers. End-to-end automation of these tiers is a pre-requisite to large-scale application deployments on the cloud.

Conventional server provisioning and runbook automation products do not have application-level smarts nor native (application-specific) automation functionality to be of help here. Apart from some very basic application binary installs, they cannot be used to automate the complex activities across the application operations lifecycle, depicted alongside - unless one ends up writing and maintaining millions of lines of custom script-code (job security, anyone?).

Stratavia provides IT organizations with a way to break this logjam at the database and application tiers. Stratavia does this via its Data Palette automation platform, along with a portfolio of automation modules, called DCA Apps that can be plugged into the underlying platform. The DCA Apps include solutions for the entire operations lifecycle of the database and middleware tiers represented in the graphic above. The solution allows companies to obtain the following benefits (while complementing prior investments such as HP/Opsware and BMC/BladeLogic ), thus truly enabling IT to be delivered as a service.

- Streamlining and improving IT operations

  • Standardize IT processes across heterogeneous platforms and assets

  • Reduce delay between service request and delivery; Improve service level metrics such as “first-time-right” and “on-time-delivery”

  • Establish & control delivery quality across multi-tiered skill sets; Enable non-SMEs to carry out complex operations

  • Improve service delivery with “self service” capability in key areas such as application build provisioning, code releases and migrations

  • Remove compliance & support risks due to variety of version, patch & configuration requirements

- Increasing efficiencies

  • Reduce IT Admin time spent on mundane activities

  • Increase Asset to Admin ratio
The reference architecture Stratavia enables to accomplish these objectives is as follows (click on the schematic to view an enlarged version):
The right side of the schematic above shows the major tiers that make up the entire application stack. The middle portion shows the role of the Data Palette automation fabric in both orchestrating and performing the administrative activities across the entire operations lifecycle. (The Data Palette platform includes the orchestration capabilities, while the DCA Apps perform the administration activities.) These lifecycle activities include provisioning and patching, configuration and compliance management, recurring maintenance (e.g., log pruning, backups, healthchecks, index rebuilds, table reorgs, partition shuffles, etc.), incident response (false positive alert suppression and white noise reduction, problem diagnosis and root cause analysis, auto-resolving known errors, etc.) and frequent service requests (e.g., code releases, database refreshes and cloning, upgrades, adding/modifying user accounts, adding space, restoring an application snapshot, failover, etc.) Data Palette also provides out-of-the-box integration adaptors to be able to auto-cut tickets, update a CMDB, and interface with various systems management toolsets in order to adhere to standard ITIL processes while carrying out these activities (not dissimilar to a DBA or App Admin who performs this work manually).

The fabric also helps in abstracting the backend component-level complexities from the end-users.

On the left side and the top, the schematic illustrates 4 classes of users:

  • Non-Technical End Users: This class of users refers to the application and business end-users. These users are typically are not too IT operations-savvy (nor should they have to be!) and conventionally request resources via a Help Desk / ticketing system such as BMC Remedy, HP Service Manager or Service-Now.com. Once the ticket is created, it is assigned to the appropriate technicians and may traverse multiple IT operations groups before the request is fulfilled. Data Palette enables self-service capabilities in this scenario by presenting a Service Catalog front-end to these users. Frequently, Data Palette’s native adaptors are used to integrate with existing ticketing systems so that these end-users do not have to be exposed to the Data Palette console (and not have to learn yet another tool or interface!). The Service Catalog is established within the system they are already familiar with, wherein they can put in their request along with relevant details such as service name, required duration, billing code, etc. Once the request is saved, it can be auto-routed to a manager for approval. The ticket creation or approval action triggers an automated workflow (within a Data Palette DCA App) that will provision the service and make it available to the end-user while updating/closing out the ticket once the service is brought online. The service is usually multi-layers and can comprise multiple sub-workflows that will provision a database instance, install an Apache webserver, WebLogic app server, create user accounts in the database and so on. The minutiae are abstracted from the end-user.

  • IT Operators – This category of users refers to Tier 1 personnel such as Help Desk operators, NOC personnel and even outsourced/offshore administrators in some cases. These users tend to be the preliminary points of contact for alerts from different monitoring tools or problem calls from end users. Data Palette empowers these IT operators to be able to carry out automated incident triage and even auto-remediation of recurring incidents thereby reducing the need to escalate to IT Operations Administrators (Tier 2 personnel). IT Operators are not SMEs, but have a greater degree of awareness of the IT environment and hence can have direct access to the Data Palette console (i.e., bypass the previously mentioned Service Catalog) along with the ability to execute specific workflows in certain environments - managed via Data Palette’s multi-tenancy and role-based access control.

  • IT Operations Administrators – These are the Tier 2 SMEs – the DBAs and the Application Admins that have privileges within Data Palette to deploy automation services, along with the ability to set relevant Policies and metadata to properly influence Data Palette’s automation behavior in the environments they manage.

  • IT Operations Engineers – These are Tier 3 SMEs - the IT Operations Engineers (also referred to as Applications or Database Systems Engineers or Architects) that have the ability to define automation services by configuring the Data Palette DCA Apps and the corresponding workflows, along with any site-specific pre and post Steps. They decide which automation service should be available to which user-type across the enterprise, what parameters should be entered (balancing ease-of-use against flexibility and control), which toolsets to integrate with, which metadata to leverage, and so on. Accordingly, their role within Data Palette is a super-set of the prior users allowing them to read, write (update) and execute automation workflows and corresponding service definitions.

Finally, the bottom portion of the reference architecture shows Data Palette integrating with existing enterprise monitoring and popular configuration and compliance audit toolsets such as MS SCOM, HP OVO, Patrol, Tivoli, Tripwire, EMC Ionix Configuration Manager and Guardium via its integration adaptors for these product sets. These products (and others like them) are frequently already deployed by enterprises for performance monitoring and scanning OS, database and application configurations and can be set up to invoke a Data Palette remediation workflow (via Data Palette’s web service APIs) to address drifts and SLA violations. Data Palette’s configuration repair and incident resolution workflows can fix the violations in online mode, or schedule the repair during the appropriate maintenance window based on pre-defined policies. Administrators can set up on an environment-by-environment basis, which violations should be immediately repaired, versus which ones need to be scheduled, versus which ones can be safely ignored.

Enterprise features of the Data Palette platform such as multi-tenancy, RBAC (role-based access control), single sign-on, LDAP integration and Smart Groups (wherein multiple intra-cloud assets can be addressed and manipulated as a single entity) along with self-configuring, out-of-the-box automation content for the database and application tiers makes the above architecture and corresponding value imminently attainable. (As a point of reference, a Proof of Concept takes 3 days to 2 weeks depending on the scope; a broader Pilot including integrations with existing toolsets can be implemented in 2 to 4 weeks.)

Email me at vdevraj at stratavia dot com if you would like a detailed whitepaper on this solution architecture.

Monday, July 20, 2009

Why aren't databases getting migrated to VMware?

During a recent customer CTO focus group meeting, a key topic discussed was the perceived unwillingness (or should one say, inability) of many larger organizations to move their databases onto virtual servers.

The majority of databases, especially production systems continue to run on physical servers. Given the emphasis in today's economy on data center consolidation and the need to enable newer IT delivery models such as self-service and private cloud initiatives, migrating databases from expensive underutilized physical servers to shared virtual environments should be a no-brainer. However IT managers find themselves swimming upstream as soon as they broach the topic of database migrations with their application users or for that matter, their DBAs.

Migrating databases to virtual environments is difficult for several reasons - especially given their complexity and mission-critical nature, the extended time required to perform those migrations gracefully and the general lack of tolerance from application users to corresponding maintenance/outage windows. A database has hooks deep into the underlying operating system as well as the overlaying application stack. Even relatively “minor” migrations wherein both the source and target platforms and versions are exactly the same can impact database performance and stability if all the associated factors are not duly considered. Let's explore some of these factors.

A database migration in this context is defined as moving a database from a physical to a virtual environment. Whenever possible, the source and target environments retain the same operating system (OS) attributes. But frequently, those may need to be changed as well. For instance, when migrating an Oracle or DB2 database running on an IBM pSeries server with AIX 5.x to a virtual environment on VMware, the underlying operating system type has to change to a Linux flavor on the x86 platform. Accordingly, the following are the two main options for a database migration:
• The same OS on both the physical (source) and virtual (target) environments
• A change in the OS type or version in the target virtual environment.

The latter use case is obviously more complex than the former. Regardless, with both use cases, there are several factors, decision points and associated actions that need to be taken into consideration. The sequence of actions in the schematic below (Figure 1) illustrates a few of the core issues associated with such a migration for Oracle, SQL Server and DB2. (Click on the picture below to view a larger image.)


Especially as Rows 2 and 4 in Figure 1 reveal, there are specific actions that need to be taken both on the source (physical) and target (virtual) environments. Furthermore, all of these actions need to be taken in line with corporate standards and best practices including checking for change control approvals, carrying out the work within pre-defined maintenance windows, and rolling back specific actions upon failure or other environment conditions such as the maintenance window being exceeded.

All too often, IT managers and generic IT administrators (read, non-DBAs) mistakenly believe that the scope of database migrations is limited to provisioning a new target database server, copying the data contents from the source to the target and finally, re-pointing the applications to the new target environment. In this context, one needs to differentiate between a database server and a database instance. In the case of the former, one can use existing provisioning tools to set up the requisite platform and version build, patch-level, kernel parameters, file-systems and other operating system-related aspects (equating to Row 3 in Figure 1 above) and utilize standard run book automation (RBA) tools to interact with change control and ticketing mechanisms (equating to Rows 1 and 5 in Figure 1), however these provisioning and RBA tools fall short when it comes to the setup and management of much of the database internals (as shown in Rows 2 and 4 in Figure 1) – tasks that make up the core of the migration process. For instance, tools such as BMC BladeLogic, HP Opsware SAS and VMware vCenter Server can rapidly provision target database server environments, but lack the database instance and application specific context and the corresponding automation content to discern and establish crucial post-server-provisioning aspects. Conventional RBA tools such as BMC BladeLogic Orchestration Manager, HP Operations Orchestrator and VMware vCenter Orchestrator have the capability to interact with change control and ticketing systems to perform the peripheral tasks in Rows 1 and 5 in Figure 1, but lack knowledge of database internals hence requiring all the steps indicated in Rows 2 and 4 to be developed from scratch – a process that can take multiple years.

Similarly, Virtual server migration tools as VMware vCenter Converter (a.k.a. VMware P2V) and Novell PlateSpin PowerConvert that migrate a server in its entirety also don't do justice to the task at hand because a database server may comprise multiple databases/instances supporting several different applications, and the DBA may need to selectively migrate a few databases/instances at a time (based on application user and change management approvals) rather than the entire database server. P2V-like tools don't take these intra-database structures and other application-level dependencies into account. Also, these tools only deal with x86 servers running Windows and Linux. Non-x86 hardware platforms and other operating systems such as AIX and Solaris are woefully ignored.

A database instance has its own unique set of requirements and configuration options that need to be defined and managed. Take for example, an Oracle database environment. A typical mid-sized to large company (the kind that will benefit most from automation) may have dozens, even hundreds of database instances across different versions (e.g., 9i, 10g, 11g) and configurations (standalone, RAC, Data Guard, etc.). New databases have to be set up for multiple environments (production, QA, stage, etc.) and applications (OLTP, reporting, warehouse, batch, etc.). Based on the OS platform, database version, configuration, usage factors and size, different data transfer methods have to be selectively applied during a migration (e.g., RMAN duplicate, Transportable Tablespaces, Data Pump, Export/Import, etc.). Once the data is extracted, it needs to be imported into the target environment using the appropriate method(s) and reconfigured for user and application access. Structures such as stored procedures, triggers and other objects need to be transferred as well using appropriate mechanisms. External stored procedures need to be recompiled and relinked on the target. Depending on the source and target operating systems, issues such as endian formats, byte sizes and character types need to be considered. Shared libraries have to be installed and clustering related parameters have to be defined. Backup methods need to be reconfigured and rescheduled. Agents need to be reinstalled. Maintenance and other scheduled jobs have to be set up. Database links need to be re-established in the case of federated or distributed databases. Thus, several configuration, security and performance related factors need to be evaluated and appropriately addressed.

Many of these considerations have prevented DBAs from viewing automation as a reliable method to perform migrations. And performing them manually is not a realistic option since they can take up several thousands of DBA hours and carry a significant risk of performance degradation or downtime if human errors creep in. These complex issues prevent most databases from being migrated to virtual environments, in spite of the obvious cost advantages associated with virtualization. There is adequate fear prevalent amongst IT managers to keep their hands off databases and leave them running on mammoth underutilized physical servers.

That’s where database automation products such as Stratavia’s Data Palette can help. Specifically, Data Palette provides value over conventional server provisioning and RBA technologies as well as P2V-type tools in two key ways:
- due to its comprehensive database automation content, it can handle complex migration use cases out-of-the-box

- due to its ability to collect, persist and embed metadata within its automation workflows (to provide environmental awareness to those workflows), it can handle heterogeneity and ongoing changes in platform builds, versions, application attributes and usage profiles in a much more scalable manner. It allows a consistent set of migration workflows to be used across disparate environments from a central console, making the automation easy to deploy, maintain and run.

Migrating databases from physical servers to virtual machines becomes straightforward with Data Palette due to the ability of its workflow engine to span both physical and virtual environments (within a single workflow) - allowing the entire spectrum of activities related to database migration (as illustrated in Figure 1) to be automated end-to-end with the press of a button.

This automation content coupled with Data Palette’s role-based access control and Service Catalog type user interface allows senior DBAs as well as junior or offshore database operations personnel and even non-DBAs for that matter (e.g., trusted, but non-database-savvy IT personnel such as systems administrators and application project leads) to perform these migrations in self-service mode without requiring them to have knowledge of underlying database processes, and without local administrative access on the source and target servers - in a manner that meets all DBA and change control approvals.

Post-migration, a key Data Palette feature called Smart Groups™ allows the database instances to retain all of their prior monitoring, maintenance and other scheduled activity in a seamless manner without requiring any manual intervention. These advanced management capabilities and out-of-the-box automation content allows complex database physical to virtual migration projects to be completed in days, rather than weeks and months.

Friday, May 15, 2009

Implementing a Simple Internal Database or Application Cloud – Part II

Based on the overview of private clouds in my prior blog, here’s the 5-step recipe for launching your implementation:

  1. Identify the list of applications that you want to deploy on the cloud;

  2. Document and publish precise end-user requirements and service levels to set the right expectations – for example, time in hours to deliver a new database server;

  3. Identify underlying hardware and software components – both new and legacy – that will make up the cloud infrastructure;

  4. Select the underlying cloud management layer for enabling existing processes, and connecting to existing tools and reporting dashboards;

  5. Decide if you wish to tie in access to public clouds for cloud bursting, cloud covering or simply, backup/archival purposes.

Identify the list of applications that you want to deploy on the cloud
The primary reason for building a private cloud is control – not only in terms of owning and maintaining proprietary data within a corporate firewall (mitigating any ownership, compliance or security concerns), but also deciding what application and database assets to enable within the cloud infrastructure. Public clouds typically give you the option of an x86 server running either Windows or Linux. On the database front, it’s usually either Oracle or SQL Server (or the somewhat inconsequent MySQL). But what if your core applications are built to use DB2 UDB, Sybase or Informix? What if you rely on older versions of Oracle or SQL Server? What if your internal standards are built around AIX, Solaris or HP/UX system(s)? What if your applications need specific platform builds? Having a private cloud gives you full control over all of these areas.

The criteria for selecting applications and databases to reside on the cloud should be governed by a single criterion – popularity; i.e., which applications are likely to proliferate the most in the next 2 years, across development, test and production . List down your top 5 or 10 applications and their infrastructure – regardless of operating system and database type(s) and version(s).

Document precise requirements and SLAs for your cloud
My prior blog entry talked about broad requirements (such as self-service capabilities, real-time management and asset reuse), but break those down into detailed requirements that your private cloud needs to meet. Share those with your architecture / ops engineering peers, as well as target users (application teams, developer/QA leads, etc.) and gather input. For instance, a current cloud deployment that I’m working with is working to meet the following manifesto, arrived at after a series of meticulous workshops attended by both IT stakeholders and cloud users:


In the above example, BMC Remedy was chosen as the front-end for driving self-service requests largely because the users were already familiar with using that application for incident and change management. In its place, you can utilize any other ticketing system (e.g., HP/Peregrine or EMC Infra) to present a friendly service catalog to end users. Up-and-coming vendor Service-now extends the notion of a service catalog to include a full-blown shopping cart and corresponding billing. Also, depending on which cloud management software you utilize, you may have additional (built-in) options for presenting a custom front-end to your users – whether they are IT-savvy developers, junior admin personnel located off-shore or actual application end-users.


Identify underlying cloud components
Once you have your list of applications and corresponding requirements laid out, you can begin to start the process of defining which hardware and software components you are going to use. For instance, can you standardize on the x86 platform for servers with VMware as the virtualization layer? Or do you have a significant investment in IBM AIX, HP or Sun hardware? Each platform tends to have its own virtualization layer (e.g., AIX LPARs, Solaris Containers, etc.), all of which can be utilized within the cloud. Similarly, for the storage layer, can you get away with just one vendor offering – say, NetApp filers, or do you need to accommodate multiple storage options such as EMC and Hitachi? Again, the powerful thing about private clouds is – you get to choose! During a recent cloud deployment, the customer required us to utilize EMC SANs for production application deployments, and NetApp for development and QA.

Also, based on application use profiles and corresponding availability and performance SLAs, you may need to include clustering or facilities for standby databases (e.g., Oracle Data Guard or SQL log shipping) and/or replication (e.g., GoldenGate, Sybase Replication Server).

Now as you read this, you are probably saying to yourself – “Hey wait a minute… I thought this was supposed to be a recipe for a ‘simple cloud’. By the time I have identified all the requirements, applications and underlying components (especially with my huge legacy footprint), the cloud will become anything but simple! It may take years and gobs of money to implement anything remotely close to this…” Did I read your mind accurately? Alright, let’s address this question below.


Select the right cloud management layer
Based on all the above items, the scope of the cloud and underlying implementation logistics can become rather daunting – making the notion of a “simple cloud” seem unachievable. However here’s where cloud management layers comes to the rescue. A good cloud management layer keeps things simple via three basic functions:

  1. Abstraction;
  2. Integration; and
  3. Out-of-the-box automation content
Cloud management software ties in existing services, tools and processes to orchestrate and automate specific cloud management functions end-to-end. Stratavia’s Data Palette is an example of intelligent cloud management software. Data Palette is able to accommodate diverse tasks and processes – such as asset provisioning, patching, data refreshes and migrations, resource metering, maintenance and decommissioning – due to significant out-of-the-box content. (Stratavia references them as Solution Packs, which are basically discrete products that can be plugged into the underlying Data Palette platform.) All such content is externally referenceable via its Web Service API for easy integration with existing tools – making it easy to integrate with 3rd party service catalogs such as Service-Now or Remedy.

Data Palette does not post restrictions on server, operating system and infrastructure components within the cloud. Its database Solution Packs support various flavors and versions of Oracle, SQL Server, DB2, Sybase and Informix running on UNIX (Solaris, HP/UX and AIX), Linux and Windows. Storage components such as NetApp are managed via the Zephyr API (ZAPI) and OnTapi interfaces.

However in addition to out-of-the-box integration and automation capabilities, the primary reason for keeping complexity at bay is due to the use of an abstraction layer. Data Palette uses a metadata repository that is populated via native auto-discovery (or via integration to pre-deployed CMDBs and monitoring tools) to gather a set of configuration and performance metadata that identifies the current state of the infrastructure, along with centrally defined administrative policies. This central metadata repository makes it possible for the right automated procedures to be executed on the right kind of infrastructure – avoiding mistakes (typically associated with static workflows from classic run book automation products)) such as executing an Oracle9i specific data refresh method on a recently upgraded Oracle10g database – without the cloud administrator or user having to track and reconcile such infrastructural changes and manually adjust/tweak the automation workflows. Such metadata-driven automation keeps the automation workflows dynamic, allowing the automation to scale seamlessly to hundreds or thousands of heterogeneous servers, databases and applications.

Metadata collections can also be extended to specific (custom) application configurations and behavior. Data Palette allows Rule Sets to be applied to incoming collections to identify and respond to maintenance events and service level violations in real-time making the cloud autonomic (in terms of self-configuring and self-healing attributes), with detailed resource metering and administrative dashboards.

Data Palette’s abstraction capabilities also extends to the user interface wherein specific groups of cloud administrators and users are maintained via multi-tenancy (called Organizations), Smart Groups™ (dynamic groupings of assets and resources), and role based access control.

Optionally tie in access to public clouds for Cloud Bursting, Cloud Covering or backup/archival purposes
Now that you are familiar with the majority of the ingredients for a private cloud rollout, the last item worth considering is whether to extend the cloud management layer’s set of integrations to tie into a public cloud provider – such as GoGrid, Flexiscale or Amazon EC2. Based on specific application profiles, you may be able to use a public cloud for Cloud Bursting (i.e., selectively leveraging a public cloud during peak usage) and/or Cloud Covering (i.e., automating failover to a public cloud). If you are not comfortable with the notion of a full service public cloud, you can consider a public sub-cloud (for specific application silos e.g., “storage only”) such as Nirvanix or EMC Atmos for storing backups in semi-online mode (disk space offered by many of the vendors in this space is relatively cheap – typically 20 cents per GB per month). Most public cloud providers offer an extensive API set that the internal cloud management layer can easily tap into (e.g., check out wiki.gogrid.com). In fact, depending on your internal cloud ingredients, you can take the notion of Cloud Covering to the next level and swap applications running on the internal cloud to the external cloud and back (kind of an inter-cloud VMotion operation, for those of you who are familiar with VMware’s handy VMotion feature). All it takes is an active account (with a credit card) with one of these providers to ensure that your internal cloud has a pre-set path for dynamic growth when required – a nice insurance policy to have for any production application – assuming the application’s infrastructure and security requirements are compatible with that public cloud.

Wednesday, January 28, 2009

Implementing a Simple Internal Database or Application Cloud - Part I

A “simple cloud”? That comes across as an oxymoron of sorts since there’s nothing seemingly simple about cloud computing architectures. And further, what do DBAs and app admins have to do with the cloud, you ask? Well, cloud computing offers some exciting new opportunities for both Operations DBAs and Application DBAs – models that are relatively easy to implement, and bring immense value to IT end-users and customers.

The typical large data center environment has already embraced a variety of virtualization technologies at the server and storage levels. Add-on technologies offering automation and abstraction via service oriented architecture (SOA) are now allowing them to extend these capabilities up the stack – towards private database and application sub-clouds. These developments seem more pronounced in the banking, financial services and managed services sectors. However while working on Data Palette automation projects at Stratavia, every once-in-a-while I do come across IT leaders, architects and operations engineering DBAs in other industries as well, that are beginning to envision how specific facets of private cloud architectures can enable them to service their users and customers more effectively (while also compensating for the workload for some of their colleagues that have exited their companies due to the ongoing economic turmoil). I wanted to specifically share here some of the progress in database and application administration with regard to cloud computing.

So, for those database and application admins that haven’t had a lot of exposure to cloud computing (which BTW, is a common situation since most IT admins and operations DBAs are dealing with boatloads of “real-world hands-on work” rather than participating in the next evolution of database deployments), let’s take a moment to understand what it is and its relative benefits. An “application” in this context, refers to any enterprise-level app - both 3rd party (say, SAP or Oracle eBusiness Suite) as well as home-grown N-Tier apps that have a fairly large footprint. Those are the kind of applications that get maximum benefit from the cloud. Hence I use the word “data center asset” or simply “asset” to refer to any type of database or application. However at times, I do resort to specific database terminology and examples, which can be extrapolated to other application and middleware types as well.

Essentially a cloud architecture refers to a collection of data center assets (say, database instances, or just schemas to allow more granularity) that are dynamically provisioned and managed throughout their lifecycle – based on pre-defined service levels. This lifecycle covers multiple areas starting with deployment planning (e.g., capacity, configuration standards, etc.), provisioning (installation, configuration, patching and upgrades) and maintenance (space management, logfile management, etc.) extending all the way to incident and problem management (fire-fighting, responding to brown-outs and black-outs), and service request management (e.g., data refreshes, app cloning, SQL/DDL release management, and so on). All of these facets are managed centrally such that the entire asset pool can be viewed and controlled as one large asset (effectively virtualizing that asset type into a “cloud”).

Here’s a picture representing a fully baked database cloud implementation (if the picture is blurry, click on it to open up a clearer version):

As I had mentioned in a prior blog entry, there are multiple components that have come together to enable a cloud architecture. But more on that later. Let’s look at database/application specific attributes of a cloud (you could read it as a list of requirements for a database cloud).

  • Self-service capabilities: Database instances or schemas need to be capable of rapidly being provisioned based on user specifications by administrators, or by the users themselves (in selective situations – areas where the administrators feel comfortable giving control to the users directly). This provisioning can be done on existing or new servers (the term “OS images” is more appropriate given that most of the “servers” would be virtual machines rather than real bare metal) with appropriate configuration, security and compliance levels. Schema changes or SQL/DDL releases can be rolled out in a scheduled manner, or on-demand. The bulk of these releases, along with other service requests (such as refreshes, cloning, etc.) should be capable of being carried out by project teams directly– with the right credentials (think, role-based access control).
  • Real-time infrastructure: I'm borrowing a term from Gartner (specifically, distinguished analyst Donna Scott's vocabulary) to describe this requirement. Basically, the assets need to be maintained in real-time per specific deployment policies (such as development environment versus QA or Stage), tablespaces and datafiles created per specific naming / size conventions and filesystem/mount-point affinity (accommodating specific SAN or NAS devices, different LUN properties and RAID levels for reporting/batch databases versus OLTP environments), data backed up at the requisite frequency per the right backup plan (full, incremental, etc.), resource usage metered, failover/DR occurring as needed, and finally, archived and de-provisioned based on either a specific time-frame (specified at the time of provisioning) or on-demand -- after the user or administrator indicates that the environment is no longer required (or after a specific period of inactivity). All of this needs to be subject to administrative/manual oversight and controls (think, dashboards and reports, as well as ability to interact with or override automated workflow behavior).
  • Asset type abstraction and reuse: One should be able to mix-and-match these asset types. For instance, one can rollout an Oracle-only database farm or a SQL Server-only estate. Alternatively, one can also span multiple database and application platforms allowing the enterprise to better leverage their existing (heterogeneous) assets. Thus, the average resource consumer (i.e., the cloud customer) shouldn’t have to be concerned about what asset types or sub-types are included therein – unless they want to override default decision mechanisms. The intra-cloud standard operating procedures take those physical nuances into account, thereby effectively virtualizing the asset type.
The benefit of a database cloud includes empowering users to carry out diverse activities in self-service mode in a secure, role-based manner, which in turn, enhances service levels. Activities such as having a database provisioned or a test environment refreshed can often take multiple hours and days. Those can be reduced to a fraction of their normal time – reducing latency especially in situations where there needs to be hand-offs and task turn-over across multiple IT teams. In addition, the resource-metering and self-managing capabilities of the cloud allow better resource utilization and avoids resource waste, improving performance levels, and reducing outages and removing other sources of unpredictability from the equation.

A cloud, while viewed as bleeding edge by some organizations is being viewed by larger organizations as being critical – especially in the current economic situation. Rather than treating each individual database or application instance as a distinct asset and managing it per its individual requirements, a cloud model allows virtual asset consolidation, thereby allowing many assets to be treated as one and promoting unprecedented economies of scale in resource administration. So as companies continue to scale out data and assets, but cannot afford to correspondingly scale up administrative personnel , the cloud helps them achieve non-linear growth.

Hopefully the attributes and benefits of a database or application cloud (and the tremendous underlying business case) become apparent here. My next blog entry (or two) will focus on the requisite components and the underlying implementation methods to make this model a reality.

Friday, January 09, 2009

Protecting Your IT Operations from Failing IT Services Firms

The recent news about India-based IT outsourcing major Satyam and its top management’s admissions of accounting fraud bring forth shocking and desperate memories of an earlier time – when multiple US conglomerates such as Enron, Arthur Andersen, Tyco, etc. fell under similar circumstances, bringing down with them the careers and aspirations of thousands of employees, customers and investors. Ironically Satyam (the name means “truth” in the mother language, Sanskrit), whose management have been duping investors for several years now (by their own admission) had received the Recognition of Commitment award from the US-based Institute of Internal Auditors in 2006, and was featured in Thomas Friedman’s best-seller “The World is Flat”. Indeed, how the mighty have fallen…

As one empathizes with those affected, the key question that comes to mind is, how do we prevent another Satyam? However that line of questioning seems rather idealistic. The key question should probably be, how can IT outsourcing customers protect themselves from these kinds of fallouts? Given how flat the world is, an outsourcing vendor’s (especially one as ubiquitous as Satyam in this market) fall from grace has reverberations throughout the global IT economy - directly in the form of failed projects, and indirectly in the form of lost credibility for customer CIOs who rely on these outsourcing partners for their critical day-to-day functioning.

Having said that, here are some key precautionary measures (in an evolving order) companies can take to protect themselves and their IT operations beyond standard sane efforts such as using multiple IT partners, use of structured processes and centralized documentation/knowledge-bases.
· Move from time & material (T&M) arrangements to fixed-priced contracts
· Move from static knowledge-bases to automated standard operating procedures (SOPs)
· Own the IP associated with process automation

Let’s look at how each of these afford higher protection in situations such as the above:
· Moving from T&M arrangements to fixed price contracts - T&M contracts rarely provide incentive to the IT outsourcing vendor to bring in efficiencies and innovation. The more hours that are billed, the more revenue they make – so,where’s the motivation to reduce the manual labor? On the other hand, T&M labor makes customers vulnerable to loss of institutional knowledge and gives them little to no leverage when negotiating rates or contract renewals because switching out a vendor (especially one that holds much of the “tribal knowledge”) is easier said than done.

With fixed price contracts, the onus on ensuring quality and timely delivery is on the IT services vendor (to do so profitably requires use of as little labor as possible) and subsequently, one finds more structure (such as better documentation and process definition) and higher use of innovation and automation. All of this works in the favor of the customer and in the case of a contractor or the vendor no longer being available, makes it easier for a replacement to hit the ground running.

· Moving from static knowledge-bases to automated SOPs – It is no longer enough to have standard operating procedures documented within Sharepoint-type portals. It is crucial to automate these static run books and documented SOPs via data center automation technologies, especially newer run book automation product sets (a.k.a. IT process automation platforms) that allow definition and utilization of embedded knowledge within the process workflows. These technologies allow contractors to move static process documentation to workflows that use this environmental knowledge to actually perform the work. Thus, the current process knowledge no longer merely resides in peoples’ heads, but gets moved to a central software platform thereby mitigating loss of key contractor personnel/vendors.

· Owning the IP associated with such process automation platforms – Frequently, companies that are using outsourced services ask “why should I invest in automation software? I have already outsourced our IT work to company XYZ. They should be buying and using such software. Ultimately, we have no control over how they perform the work anyway…” The Satyam situation is a classic example of why it behooves end-customers to actually purchase and own IP related to process automation software, rather than deferring it to the IT services partner. By having process IP defined within a software platform that the customer owns, it makes it conceivable to switch contractors and/or IT services firms. If the IT services firm owns the technology deployment, the corresponding IP walks out the door with the vendor preventing the customer from getting the benefit of the embedded process knowledge.

It is advisable for the customer to have some level of control and oversight over how the work is carried out by the vendor. It is fairly commonplace for the customer to insist on use of specific tools and processes such as ticketing systems, change control mechanisms, monitoring tools and so on. The process automation engine shouldn’t be treated any different. The bottomline is, whoever has the process IP carries the biggest stick during contract renewals. If owning the technology is not feasible for the customer, at least make sure that the embedded knowledge is in a format wherein it can be retrievable and reused by the next IT services partner that replaces the current one.