Devops musings: 2013

WebSphere Test Practices. Part 1 of 3: Challenges.

This next retrospective is related to Kevin’s great post on Automation and Cloud for System Verification Test and broken into 3 parts:

1. WebSphere Test Challenges

2. WebSphere Test Transformation
3. How does this related to DevOps and Continuous Delivery

Similar to Rational System Verification Test the WebSphere Development Organization found the usage of Patterns and Cloud to be of great value. Kevin’s scenario focused on automating very complicated deployments that allowed for the execution of test scenarios that otherwise could not be contained. This next story focuses on the elasticity of Cloud to enable high volume automated test execution, or Continuous Test, as a part of the development and build process.

Overview and scope

The WebSphere organization, feature set and code base is relatively large. From an organization perspective there were over 600 developers and 200 engineers involved in test and release engineering. The infrastructure to support this was fairly significant with the test organization owning and maintaining around 3000 + cores , 500+ z/OS systems on 10+ LPARs, 45+ iSeries LPARs. The continuous test effort now runs over 1.7 million functional test cases every day, and over 16+ hours of continuous security variations. 25+ OS variations and at least 8 Database variations are needed to thoroughly perform a meaningful regression suite. The delivery process for the product was broken into a sequence of phases design, development, functional test, system test, performance test, media test … The point here is that delivery of such a large offering requires a significant amount of effort to test, took a long time and process changes were challenging.

Challenges and objectives

What happened within WebSphere is not unique. Over the initial 5 or so years of development the product moved very quickly with an ever increasing number of resources available as success was demonstrated. This trend however reached a tipping point where cost of maintaining and testing the current feature set competes with the ability to delivery on new customer requirements. We had reached a point where regardless of the resources we applied testing the product took a long time. As often happens this feeds into itself as you attempt to fit more and more ‘must have’ content into the current release. It was time to optimize this process … lets have a slightly closer look at the costs we were absorbing.

The cost of a regression is exponentially proportional to the time it takes to detect that regression. This is because it is easy to fix a regression when it is introduced since the change is fresh, does not have other changes built on top of it and the people involved are available. Using our waterfall style delivery process it on average took us 3 months to find a regression. This needed to come down to withn a single day.

The time that it would take us to execute a functional regression of the Application Server was 6 weeks with around 70 Full Time Equivalent employees. This process consistently bleed over into other phases and over 75% of our Integration or System Verification Scenarios would be blocked by a basic functional failure at some point in time. We had to get to the point where we could execute a functional regression of the application server with little to no human cost, and within hours not weeks.

We were hardware constrained. We had a lot of machines but try finding one to use. Though our lab showed only 6% of our infrastructure was in use at any given point in time it was always assigned. Teams were spending time justifying new hardware requests, overestimating what they needed and there was a bit of hording going on. We needed self-service access to infrastructure and monitoring to govern misuse.

What else was costing us time and money? Organizational boundaries. We had many organizations responsible for a particular delivery. Development teams, Functional Test teams, System Persona teams, Performance Teams, Hardware teams, Test Automation Teams … the list goes on and on. As code transitions between teams there is a significant cost. Certain teams become bottlenecks and often one team has a different set of objectives or incentives than another so do not align. Development would throw code over the wall and see it as ‘Test’s’ job to test it … Test would not gather enough information when things did not work … teams were blocked by a lack of infrastructure … 4 different automation infrastructures custom built for specific purposes. These boundaries existed for some good reasons but we had reached a point where they were slowing us down too much and could not function in a world where resources were shrinking not growing.

In the next section we will look at some of the things we did to address these problems.

Automation and Cloud for System Integration test

This is a quick overview of very recent efforts within the IBM Rational development organization to employ cloud technology coupled with aggressive provisioning, product install and configuration automation to improve and streamline our product System Integration test processes with an eye towards Continuous Delivery.

Motivation and challenges

Its a common story for sure but our product delivery teams here at IBM Rational are very keen to constantly improve product quality. They are also committed to shorter and shorter product cycles ... essentially wanting to create, evolve and deliver our products faster and with increasing product quality. Piece O’ cake, right? Nope.

Our system verification and integration testing is central to overall product quality and requires install, configuration and and test-scenario execution of our products on a huge variety of OS platforms, databases and physical topologies with large numbers of internal and third party integrations.

When we started down this path to shorter development cycles we were primarily installing and configuring test systems by hand on physical hardware. This was doable given our traditional long product cycle. However, we saw ourselves quickly approaching a very hard wall when asked to support shorter cycles. This meant that, instead of going through this complex system provision, install, configure and scenario test cycle by hand x times a year, we would have to do it 2x times a year, then 4x, etc. We needed to change our way of doing things drastically ... and soon.

Our gamble centered on two key themes: Golden Topologies and Cloud-based systems coupled with install and configuration automation. The Golden Topology approach enables us to focus on a finite set of test systems out of the, essentially, infinite combinations possible within our set of supported platforms, databases, integrations, etc. The rest of this post will describe our approach to cloud-based delivery of ready-to-test systems and we will provide another article focused on our Golden Topology strategy in the near future.

Summary of implementation

When we began this effort we had been driving two proof-of-concept (POC) private cloud systems and each allowed self-provisioning of test systems. The first of these two systems could provide one or more independent VMs to be deployed at which point Build Forge was employed to orchestrate the execution of scripts to deploy products under test and then stitch the VMs into a coherent system.

The second POC private cloud was based on IBM Workload Deployer (IWD). This ended up our preferred cloud implementation for two reasons: first, IWD supports the concept of virtual system patterns (VSPs). A VSP can be considered a template for a logical set of VMs with relationships between them. With a VSP defined you can deploy the system of needed VMs in one operation. Secondly, orchestration and automation code can be directly attached to the VSP (in the form of script packages) and when this code executes as part of deployment it has topology metadata directly available. These capabilities allow us to provide one-stop, push button deployment of very complex system topologies.

I’ll provide a quick example of a moderately complex topology we use to system test our Rational CLM product suite. If you are not familiar, CLM (Collaborative Lifecycle Management) is a solution we deliver as a single composite product that consists of Rational Team Concert (RTC), Rational Quality Manager (RQM) and Rational Requirements Composer (RRC). One of the Golden Topologies we employ to test CLM is a fully distributed, enterprise-class topology on Linux with DB2 as the Database and WebSphere as the application server and IHS as a reverse proxy. This is also affectionately referred to as “E1”.

This topology employs a separate VM for each of the three CLM applications (CCM, RQM and RRC) and another for the shared Jazz Team Server (JTS). An additional VM is dedicated to the reverse proxy and yet another is reserved for DB2. To set up and configure this topology by hand and then install all products and configure them to work together with a common JTS and DB server is quite an undertaking even for a seasoned software engineer.

With VSPs we can take that seasoned engineer’s ability to configure such a system and burn it into automation for anyone to use. With this VSP, any team member can push a few buttons to self provision their own instance of E1!

Results and return on investment

When we started releasing this capability to our test teams the uptake was huge and immediate. This introduced a series of brand new problems around system capacity management that we can talk about in another post.

We immediately felt as if this capability was improving our ability to produce stable test systems more rapidly and with far fewer set-up and configuration errors. Being engineers we decided to define some metrics and start measuring

What we found was pretty amazing. For example, one of our more complex test system topologies for CLM is a horizontal WebSphere cluster with Oracle as the backend and WebSphere Proxy Server out front. This is, of course, is “E3”!. We surveyed some internal teams to get a sense of the time required to manually set up such a system from scratch. This is what we found:

	Time to provision, Install/configure E3
Expert User	11 hours
Novice User	30 hours*
Non-experienced User	96 hours*

*note - these novice and non-experienced users will inevitably be stealing cycles from the experts.

Average User

42 hours

So assume an average time per manual deployment of 42 hours. Now consider that any user: expert, novice or zero experience can deploy the same stable system in about 3 hours total using a VSP. Also consider it takes the user five minutes to launch the VSP deployment process. The next 2 hours and 55 minutes can be spent doing something constructive while waiting for the system to become available.

When we're talking about short development cycles where these deployments are needed in rapid succession the savings add up fast. At a savings of 40 hours per system deployment within the context of short development cycles you’re into savings measured in person-years very quickly. For example, here's a quick business value assessment of patterns of similar complexity to our clustered pattern:

Let's assume four teams (RTC, RQM, RRC, and the CLM) working concurrently and deploying a new system for each of three test topologies four times a month for one year.

Four deployments per month for four teams for three topologies for one year:

4 x 4 x 3 x 12 = 576 deployments

Forty hours savings per deployment:

576 x 40 = 23,040 hours

Assuming a full person year (FTE) is 2080 hours:

23,040 / 2080 ~= 11 FTEs

This represents a fairly conservative estimate of savings we have realized for just three patterns. The return on investment is even more impressive when you consider that an expert virtual system pattern (VSP) developer can create a pattern of this complexity within a few weeks.

Beyond this very tangible savings is the fact that these consistent rapid stable deployments enable new ways of working. When teams start considering these very complex test systems as disposable then doors open to very innovative approaches to system test. You can see why VSPs are a strategic element in our approach to continuous delivery. More on that in a future post.

What is this Blog all about

There is a lot interest around DevOps. To some degree there are a lot of similarities in the objectives and value propositions offered by Agile practices. The major difference here is the emergence of cloud technologies providing infrastructure and platform as a service. These capabilities have provided new tools in the toolbox to support continuous integration, test and delivery practices as well as making infrastructure as code and onDemand deployment a realistic objective.

Companies or projects adopting these practices tend to fall into one of two camps; small innovative projects with little legacy code or capital, and large enterprise systems looking for ways to bring incremental value at the rate markets are demanding. For the heck of it I’ll call the first set innovators and the second set optimizers.

This Blog will be about a few random things but primarily I intend it to focus on optimizers. We’ll take a look at challenges that face large existing systems and practices that move these systems in a direction that continue to meet changing market demands. The ‘term teaching elephants’ to dance comes to mind and to some degree this will an exercise to see how we can teach optimizers to live in the space that innovators so easily thrive in today. There I think will naturally be some lessons in here for innovators that are beginning to scale out.

To me Continuous Delivery, Agile, DevOps and sustainable innovation come down to the ability to trust in your repeatable quality processes. How can you go fast and have high quality? You can only go fast if you always have high quality. A second common theme is removing barriers, specifically organizational hand-offs and barriers to resources such as infrastructure. So to start with I thought I’d take a look at some Test and Automation efforts from the not so distant path and see how they apply to DevOps.

Devops musings

Wednesday, July 3, 2013