Monday, July 05, 2010

Preparing the IT Organization for the Internal Cloud

A recent article on Cloud Computing quoting IBM’s Ric Telford caught my attention. He boldly predicts that the term Cloud Computing will become obsolete over the next 5 years while the underlying methodology will become standard practice for IT. While this level of IT maturation truly seems a few years out, I find internal or private clouds to be increasingly prevalent in certain pockets of industries such as financial services, energy and managed services. On the other hand, IT departments in many other industries such as manufacturing, retail and telecom continue to struggle to enable such capabilities. The struggles have not emanated necessarily from lack of desire, budget or technology leadership, but from cultural challenges such as the following:

(a) Perceived shift in power dynamics. Internal cloud and self-service delivery models make it unclear regarding where does traditional IT Operations stop and end-users (service consumers) start driving the delivery process. Many IT Operations personnel fear that end-users will need to be granted administrator privileges to enable self-service and it will be difficult to control end-user activity. End-users will not adhere to well-defined IT standards and the requisite training and oversight will dramatically increase Operations’ workload.

(b) The lines between IT silos become blurry. When an end-user group asks for an IT service, those services are traditionally broken down into specific requests in the form of tickets and assigned to specific silos such as Sys Admin, Storage Admin, DBAs, etc. The requests are then individually delivered, often with lots of coordination back and forth with the final service being received and validated by the end-user team. While a cloud-based delivery model aims to improve such out-dated modes of functioning, it is also susceptible to severe internal bottlenecks and confusion regarding what is the right unit of delivery and who needs to own it: the application admins at the top of the tier, or the sys admins at the bottom. On the other hand, average service recipients do not care about the individual silos that make up their request; they speak in the "language" of the application whereas traditional IT Ops speaks in the language of servers, storage, database and other infrastructure elements. Thus the lack of a common denominator for requesting and delivering self-service capabilities in many organizations arrests momentum pertaining to the deployment of internal cloud capabilities.

In other words, the true barrier to enabling a private cloud is not technology, but cultural apprehensions that old world IT bridges by throwing more bodies and silos at the problem – repetitive meetings and water cooler conversations within different IT personnel and application teams provide a framework for delivery. It’s a lot of wasted effort, but the old world process delivers the services (albeit it takes days, weeks or months!)
 
CIOs that are ultimately successful in having IT groups embrace internal cloud and self-service models solve these bottlenecks in a way that’s amenable to the overall organization culture. Some CIOs do it via baby steps by retaining existing IT structure and policies and offering specific infrastructure components such as servers and databases in self-service mode to project teams. In other words, they attempt to transform IT by offering self-service capabilities on a silo by silo basis. However that is akin to a restaurant delivering specific pre-cooked ingredients to customers and expecting them to compile those ingredients into the requisite meal. Obviously such “self-service” (if it can still be labeled that) IT delivery models require the end-user to be fairly sophisticated in asking for the right resources and being able to assemble them all together themselves. This also assumes that the end-user is disciplined enough to ask for these ingredients in the right quantity (i.e., they are not wasteful or overly cautious). These assumptions rarely hold true causing the success of such internal cloud strategies to be stunted at best.
 
A better approach might be to take the time upfront to define a common denominator (across IT silos) and corresponding units for self-service – via a macro-level reference architecture. Take any candidate end-user request or task that can be delivered in self-service mode: e.g., application provisioning, code releases, creating data copies or backups, restoring snapshots, creating users, resetting passwords, resolving incidents, etc. All of these require a universal definition of what’s to be delivered to the end-user. All underlying nuances such as IT silos, their lines of control and associated complexities need to be abstracted from the end-user. The more comprehensively IT is hidden behind the curtains, the better is the private cloud implementation.
 
Based on my work with multiple IT organizations across different industries, I find that the most pragmatic unit for self-service comes from the top-most layer of the IT stack – the application. However utilizing application-specific units is easier said than done since in most large organizations (the kind that can really benefit from a private cloud), IT operations doesn’t speak the forementioned “application language”. Even if they do, there are several applications in use and most applications have multiple parts and pieces (i.e., they are all “composite applications”). In order to make application-specific units more palatable to all IT silos, the underlying descriptors of the units have to reference physical infrastructure attributes and map it to their corresponding logical (app-specific) definitions within the reference architecture.
 
Most applications that have been developed over the last decade are 3-tiered. They have a web server tier, an application server tier and a database tier, along with associated sub-components such as a load-balancer tied to a farm of web servers. Hence this popular 3-tier “application template" acts as a robust common denominator for IT. By building self-service capabilities around such a template, most applications within the enterprise are addressed. Of course, there are certain applications that employ just one or two tiers: e.g., server-based batch applications, client/server reporting applications and so on. As long as the tiers used by these disparate applications are a sub-set of the chosen application template, they are adequately covered. On the same lines, some applications employ a use of unique middleware components such as a transaction processing monitor, a message bus, queues, SMTP services, etc. Obviously such applications would fall outside the 3-tier template and would need to be dealt with separately. However with a bit of deeper investigation, I usually find that such applications are vastly outnumbered by the standard application stack that employs some or all of the 3 tiers named above. Hence such exceptions do not make an appreciable difference (in most cases) to the notion of using a single master application template in the reference architecture.
 
By applying the 80-20 rule, one can keep the application template relatively straightforward in the initial iterations of the private cloud implementation and focus on those applications and tasks to be delivered in self-service mode that will have the biggest impact. I have seen this approach work well even in larger organizations with multiple customers or lines of businesses (LOBs) because the underlying definition (of the makeup) of the application remains consistent across these disparate LOBs.
 
Once the master application template is defined, each unique application type can be assigned a profile that describes the tiers it employs. For instance, AppProfile1 (or simply, Profile1) can refer to applications that always have a web server, application server and a database instance, whereas Profile2 can refer to applications that utilize a 2-tier model (just an app server and a database), and so on. Using concepts of polymorphism, all these profiles should refer to the master application template such that any changes to the base templates will be immediately reflected across all profiles. During future iterations, additional templates and profiles can be defined to refer to applications that may have additional tiers such as a message bus. These definitions can be laid out on paper (and later implemented using an automation tool), or ideally, set up within an automation platform such as Stratavia’s Data Palette from the very beginning.
 
Regardless of the application profiles (the fewer the better in the initial phase), the deployment needs to adhere to one key point: keep the individual components for each tier hidden from the end-user while giving IT teams full control over those components. By “components”, I’m referring to the following:
 
(i) infrastructure layers: i.e., servers (virtual or physical), storage (SAN, NAS, etc., depending on IT standards and budget sizes), networking capacity, etc.; and

(ii) software layers: the web server, application server and database instance. Compared to physical infrastructure like servers and storage, each of these software layers often encompass a “best of breed” configuration across a broader variety of vendors, proprietary platforms and open source frameworks (such as Apache Tomcat, Microsoft IIS, WebLogic, Websphere, Oracle DBMS, SQL Server, SAP Sybase, VMware Springsource, etc.), and hence tend to be much more complex and time consuming for provisioning, configuration and ongoing management. Hence the overall success of the deployment calls for limiting the number of application templates and profiles during its initial phase.

Each application profile should have a description of what quantities of which specific components are needed by each tier - for initial provisioning as well as subsequent (incremental) provisioning. For instance, a ‘web server’ tier in the Sales Force Automation (SFA) suite may comprise a virtual machine with X amount of CPU, Y amount of disk space and a standard network configuration running Apache Tomcat. When an authorized end-user (e.g., a QA Manager) asks for the SFA application to be delivered to a new QA environment, the above web server unit gets delivered along with the corresponding application server and database instance – all of them pre-configured in the right units and ready for the end-user to access! However when an authorized end-user asks for more web servers to boost performance in the production SFA environment, what gets delivered is N units of only the web server tier, with specific pre-defined configuration to tie it to the corresponding (previously existing) application server and/or database. All of these mappings may be held in a CMDB (ideally!) or in some kind of operations management database and referenced at run-time by the automation workflows that are responsible for service delivery.
 
The application templates and profiles with quantitative resource descriptions allow end users to receive services in an application-friendly language, while allowing IT Ops to provision and control individual pieces of server, network, disk space, database, etc., using their existing IT standards and vocabulary. All underlying components blend together into a single pre-defined unit, enabling the well-oiled execution of a self-service delivery model within the private cloud implementation.
 
Using application provisioning as an example, the below graphic (click on it for an enlarged version) depicts how various application profiles comprising multiple software and infrastructure layers can be defined.
 

 
Regarding the other previously stated bottleneck (i.e., the perceived shift in power dynamics), while IT Ops personnel are behind the curtain, it doesn’t mean they are any less influential. Using automation products such as Data Palette, they are able to exert control both in the initial stages of service definition (along with the Operations Engineering or Architecture groups), as well as during the ongoing service delivery process by monitoring and tuning the underlying automation workflows (again, in conjunction with Operations Engineering) as depicted in the graphic below. In other words, they define and control what automation gets deployed and what is the exact sequence of steps for delivering a particular component of an application in the environments they manage. Specific product capabilities such as multi-tenancy, granular privileges and role based access control allows them to grant end-users the ability to see, request and perform specific activities in self-service mode without the need to also grant any administrative privileges they could potentially misuse.
 
 
 
As the IT Ops personnel continually assess areas of inefficiency, they can refine the workflows and roll out improvements with little to no impact to end users. Such efforts, while changing the job description of what IT Ops does on a day-to-day basis, help to deliver a better end-user experience helping companies leapfrog from the IT of yesteryears to the “industrialization of IT” that IBM’s Telford refers to.

Tuesday, February 23, 2010

What’s the “Right Way” to Automate the Application Stack?

Cloud computing is forcing IT organizations to rethink automation. Early adopters started out with the delivery of servers, storage and network connectivity in self-service mode via an internal cloud. While this helped reduce service delays, real service level improvements weren’t forthcoming. Application owners and end users were still experiencing significant elapsed times between service request and delivery. It was becoming obvious to the IT thought-leaders that merely provisioning and offering up “ping, power and pipe” quickly wasn’t going to have a meaningful impact. That just shifted the bottleneck up the stack into the application layers. The new focus of the cloud revolves around rapid deployment and mass management of applications. Without the ability to provision and manipulate discrete application services, the value of the cloud is stunted.

As the recent acquisition of Phurnace indicates, the traditional data center tools vendors have started figuring it out and are attempting to offer solutions that deliver automation aimed at the application layers. The advent of these vendors and their mainstay strength in server provisioning is bringing forth multiple interesting approaches to application automation. Approaches vary from the composite policy-based (but hands-off) VMware vApps strategy to the automation of selective app admin tasks for specific app components such as Java code deployment (e.g., Phurnace) to broader automation platforms with out-of-the-box modular content to address the entire administrative lifecycle of various app components (e.g., Stratavia). The key question that emerges for customers is “what’s the right way to automate applications?” Conventional wisdom says if your only weapon is a hammer, every problem looks like a nail. Armed with robust server virtualization and provisioning toolsets, some of the larger vendors are approaching the application stack with (no surprises!) a focus on provisioning. However at the risk of sounding clichéd, I have to say that automating application administration is a very different paradigm. 

You see, at the server layer and below (including storage and networking), significant admin time is taken up in provisioning. Post-provisioning activities such as patching and configuration management are often handled via provisioning – i.e., by reimaging the OS with an updated (patched/reconfigured) image. So all in all, provisioning is the key administrative function at these lower layers.

However as you go above the server, you encounter discrete application layers such as webservers, application servers and databases – to name the most popular components. Each of these dictate a different operations lifecycle that is not dominated by provisioning and configuration management. In fact, provisioning barely takes up 15-20% of the typical App Admins’ time. The remaining 80% time is spent on other post-provisioning activities such as maintenance, incident response and service requests (the graphic below provides examples of these task categories).

But then the traditional vendors ask, why do these App Admins do all these things in the first place? Why can’t they, like Sys Admins and Network Admins, handle these other activities via Provisioning/Re-provisioning? For instance, instead of applying a new patch or performing a maintenance operation or doing a code release, why can’t the App Admin just provision a new image that has the desired changes? That will eliminate the need for these other post-provisioning activities and free up App Admin time.

This type of argument does not work because it goes against the grain of real-world application management. Instances of an application component frequently develop a unique fingerprint over the course of their use. This fingerprint is based on several factors including security requirements, performance adjustments, and the experience and skill-level (a.k.a. best practices) of the Admins managing them. To make things worse, these fingerprints can be dynamic in nature. The same application may look different at different times of the day. For instance, a database may serve as a transactional data-store during regular business hours and may be configured specifically to facilitate smaller read/write I/O operations, whereas at night time, the same database may be converted into a batch database with different buffer sizes, log file locations, etc. to facilitate bulk writes. Bland categorization and re-imaging of the application server (say, to apply a new application patch) by the Sys Admins causes much of this dynamic application fingerprint to be lost - creating in turn a lot more work for the App Admins (to restore the fingerprint as much as they can – assuming they themselves remember all the changes and can get it right the first time!)

The other challenge is that a single server frequently hosts multiple application types and instances. This is commonly encountered even in large-scale production environments. Each of these instances may have their own maintenance window and need to be patched/ reconfigured/upgraded and managed individually based on instance-specific standards and dependencies. Reimaging (the entire server) doesn’t afford granular control of individual instances. Now one may argue that this was more prevalent in the olden days when servers were expensive and with virtualization being commoditized now, each application instance can reside on its own server image thereby eliminating this problem. But reality goes deeper than that. It’s not just server resource optimization that called for multiple application types and instances to reside on the same server; cohabitation was also tied to performance, security and other considerations. For example, in the case of a performance sensitive application that utilizes federated databases, a DBA may elect to keep some of the associated databases on the same physical OS to minimize context switching and network latency incurred due to physical separation of the databases onto different servers. Regardless of whether the servers and underlying network adaptors are physical or virtual, the location of the underlying databases can make a big difference in terms of response time for a data-join operation being in sub-seconds versus minutes. Thus without proper understanding of the various application types, and related design considerations (such as transaction types, application access methods, data volume, affinity, partitioning, etc.) and best practices, choosing the wrong automation method can result in degraded service levels.

Attempting to offer application management in the cloud with just an application image provisioning model is akin to showing up to a gun fight with a rock. Proponents of this model can claim that rapid provisioning and policy-based reimaging of the relevant application components is the new way of application management in the cloud. While this approach may work for a handful of admin functions, it does not offer granular control or a pragmatic framework for most mandatory post-provisioning tasks (represented in the graph above) and hence will be discarded by savvy App Admins. Only solution providers that have a true application management DNA (with a deep understanding of task patterns associated with various application components) and offering automated application management capabilities out-of-the-box can win legitimate mind-share in the near term and sustainable market share in the long run.