Thursday, November 12, 2009

Reference Architecture for Delivering IT as a Service

I’m back now after a 4-month “hiatus” that comprised multiple customer engagements – productive activity that keeps me (and my rants here) relevant. A key area I have been working on is using Stratavia’s Data Palette to help my customers deliver IT as a service within their organizations – or to be more precise, helping them deliver applications as a service. From an “end-user” perspective such as project managers, application team leads and QA managers (i.e., the recipients of these services), individual components of the infrastructure (plain servers or raw storage) don’t matter; it’s about having fully baked apps appropriately packaged and delivered - including the webserver, the application server and the database layers.

The obvious reason CIOs are looking to upgrade their IT delivery capabilities is around improving business efficiency and agility, while of course reducing costs. But a less frequently cited, but equally vital reason is to keep up with the competition! For instance, every financial services firm out there has already built or is building an internal cloud. In fact, larger organizations across all industry verticals are taking the next step to attain scalability via newer delivery models such as self-service and cloud computing. But to truly gain tangible benefit from such scalable models is a challenge. And currently, application administration is the weakest link in the chain!

The whole premise behind cloud computing of being able to rapidly mass-deploy applications in the cloud frequently comes to a screeching halt due to the way IT currently operates. Most IT administration teams are not just geared up to provision and manage scores of complex heterogeneous application tiers in an agile manner – unless more and more manpower is added and even that doesn’t scale after a certain point. Sure there is help from the conventional systems management tools vendors like HP, BMC, IBM, EMC, VMware and Cisco. Automation products from these venerable vendors are able to help organizations reduce server, network and storage provisioning time from multiple weeks to a few hours. But then the bottleneck just shifts upstream into the application layers: specifically the middleware and the database tiers. End-to-end automation of these tiers is a pre-requisite to large-scale application deployments on the cloud.

Conventional server provisioning and runbook automation products do not have application-level smarts nor native (application-specific) automation functionality to be of help here. Apart from some very basic application binary installs, they cannot be used to automate the complex activities across the application operations lifecycle, depicted alongside - unless one ends up writing and maintaining millions of lines of custom script-code (job security, anyone?).

Stratavia provides IT organizations with a way to break this logjam at the database and application tiers. Stratavia does this via its Data Palette automation platform, along with a portfolio of automation modules, called DCA Apps that can be plugged into the underlying platform. The DCA Apps include solutions for the entire operations lifecycle of the database and middleware tiers represented in the graphic above. The solution allows companies to obtain the following benefits (while complementing prior investments such as HP/Opsware and BMC/BladeLogic ), thus truly enabling IT to be delivered as a service.

- Streamlining and improving IT operations

  • Standardize IT processes across heterogeneous platforms and assets

  • Reduce delay between service request and delivery; Improve service level metrics such as “first-time-right” and “on-time-delivery”

  • Establish & control delivery quality across multi-tiered skill sets; Enable non-SMEs to carry out complex operations

  • Improve service delivery with “self service” capability in key areas such as application build provisioning, code releases and migrations

  • Remove compliance & support risks due to variety of version, patch & configuration requirements

- Increasing efficiencies

  • Reduce IT Admin time spent on mundane activities

  • Increase Asset to Admin ratio
The reference architecture Stratavia enables to accomplish these objectives is as follows (click on the schematic to view an enlarged version):
The right side of the schematic above shows the major tiers that make up the entire application stack. The middle portion shows the role of the Data Palette automation fabric in both orchestrating and performing the administrative activities across the entire operations lifecycle. (The Data Palette platform includes the orchestration capabilities, while the DCA Apps perform the administration activities.) These lifecycle activities include provisioning and patching, configuration and compliance management, recurring maintenance (e.g., log pruning, backups, healthchecks, index rebuilds, table reorgs, partition shuffles, etc.), incident response (false positive alert suppression and white noise reduction, problem diagnosis and root cause analysis, auto-resolving known errors, etc.) and frequent service requests (e.g., code releases, database refreshes and cloning, upgrades, adding/modifying user accounts, adding space, restoring an application snapshot, failover, etc.) Data Palette also provides out-of-the-box integration adaptors to be able to auto-cut tickets, update a CMDB, and interface with various systems management toolsets in order to adhere to standard ITIL processes while carrying out these activities (not dissimilar to a DBA or App Admin who performs this work manually).

The fabric also helps in abstracting the backend component-level complexities from the end-users.

On the left side and the top, the schematic illustrates 4 classes of users:

  • Non-Technical End Users: This class of users refers to the application and business end-users. These users are typically are not too IT operations-savvy (nor should they have to be!) and conventionally request resources via a Help Desk / ticketing system such as BMC Remedy, HP Service Manager or Service-Now.com. Once the ticket is created, it is assigned to the appropriate technicians and may traverse multiple IT operations groups before the request is fulfilled. Data Palette enables self-service capabilities in this scenario by presenting a Service Catalog front-end to these users. Frequently, Data Palette’s native adaptors are used to integrate with existing ticketing systems so that these end-users do not have to be exposed to the Data Palette console (and not have to learn yet another tool or interface!). The Service Catalog is established within the system they are already familiar with, wherein they can put in their request along with relevant details such as service name, required duration, billing code, etc. Once the request is saved, it can be auto-routed to a manager for approval. The ticket creation or approval action triggers an automated workflow (within a Data Palette DCA App) that will provision the service and make it available to the end-user while updating/closing out the ticket once the service is brought online. The service is usually multi-layers and can comprise multiple sub-workflows that will provision a database instance, install an Apache webserver, WebLogic app server, create user accounts in the database and so on. The minutiae are abstracted from the end-user.

  • IT Operators – This category of users refers to Tier 1 personnel such as Help Desk operators, NOC personnel and even outsourced/offshore administrators in some cases. These users tend to be the preliminary points of contact for alerts from different monitoring tools or problem calls from end users. Data Palette empowers these IT operators to be able to carry out automated incident triage and even auto-remediation of recurring incidents thereby reducing the need to escalate to IT Operations Administrators (Tier 2 personnel). IT Operators are not SMEs, but have a greater degree of awareness of the IT environment and hence can have direct access to the Data Palette console (i.e., bypass the previously mentioned Service Catalog) along with the ability to execute specific workflows in certain environments - managed via Data Palette’s multi-tenancy and role-based access control.

  • IT Operations Administrators – These are the Tier 2 SMEs – the DBAs and the Application Admins that have privileges within Data Palette to deploy automation services, along with the ability to set relevant Policies and metadata to properly influence Data Palette’s automation behavior in the environments they manage.

  • IT Operations Engineers – These are Tier 3 SMEs - the IT Operations Engineers (also referred to as Applications or Database Systems Engineers or Architects) that have the ability to define automation services by configuring the Data Palette DCA Apps and the corresponding workflows, along with any site-specific pre and post Steps. They decide which automation service should be available to which user-type across the enterprise, what parameters should be entered (balancing ease-of-use against flexibility and control), which toolsets to integrate with, which metadata to leverage, and so on. Accordingly, their role within Data Palette is a super-set of the prior users allowing them to read, write (update) and execute automation workflows and corresponding service definitions.

Finally, the bottom portion of the reference architecture shows Data Palette integrating with existing enterprise monitoring and popular configuration and compliance audit toolsets such as MS SCOM, HP OVO, Patrol, Tivoli, Tripwire, EMC Ionix Configuration Manager and Guardium via its integration adaptors for these product sets. These products (and others like them) are frequently already deployed by enterprises for performance monitoring and scanning OS, database and application configurations and can be set up to invoke a Data Palette remediation workflow (via Data Palette’s web service APIs) to address drifts and SLA violations. Data Palette’s configuration repair and incident resolution workflows can fix the violations in online mode, or schedule the repair during the appropriate maintenance window based on pre-defined policies. Administrators can set up on an environment-by-environment basis, which violations should be immediately repaired, versus which ones need to be scheduled, versus which ones can be safely ignored.

Enterprise features of the Data Palette platform such as multi-tenancy, RBAC (role-based access control), single sign-on, LDAP integration and Smart Groups (wherein multiple intra-cloud assets can be addressed and manipulated as a single entity) along with self-configuring, out-of-the-box automation content for the database and application tiers makes the above architecture and corresponding value imminently attainable. (As a point of reference, a Proof of Concept takes 3 days to 2 weeks depending on the scope; a broader Pilot including integrations with existing toolsets can be implemented in 2 to 4 weeks.)

Email me at vdevraj at stratavia dot com if you would like a detailed whitepaper on this solution architecture.