This next retrospective is related to Kevin’s great post on
Automation and Cloud for System Verification Test and broken into 3 parts:
1. WebSphere Test Challenges
2. WebSphere Test Transformation
3. How does this related to DevOps and Continuous Delivery
3. How does this related to DevOps and Continuous Delivery
Similar to Rational System Verification Test
the WebSphere Development Organization found the usage of Patterns and Cloud to
be of great value. Kevin’s scenario
focused on automating very complicated deployments that allowed for the
execution of test scenarios that otherwise could not be contained. This next story focuses on the elasticity of
Cloud to enable high volume automated test execution, or Continuous Test, as a
part of the development and build process.
Overview and scope
The WebSphere organization, feature set and code base is
relatively large. From an organization
perspective there were over 600 developers and 200 engineers involved in test
and release engineering. The
infrastructure to support this was fairly significant with the test
organization owning and maintaining around 3000 + cores , 500+ z/OS systems on
10+ LPARs, 45+ iSeries LPARs. The
continuous test effort now runs over 1.7 million functional test cases every
day, and over 16+ hours of continuous security variations. 25+ OS variations and at least 8 Database
variations are needed to thoroughly perform a meaningful regression suite. The delivery process for the product was
broken into a sequence of phases design, development, functional test, system
test, performance test, media test … The
point here is that delivery of such a large offering requires a significant
amount of effort to test, took a long time and process changes were challenging.
Challenges and objectives
What happened within WebSphere is not unique. Over the initial 5 or so years of development
the product moved very quickly with an ever increasing number of resources
available as success was demonstrated.
This trend however reached a tipping point where cost of maintaining and
testing the current feature set competes with the ability to delivery on new customer
requirements. We had reached a point
where regardless of the resources we applied testing the product took a long
time. As often happens this feeds into itself as you attempt to fit more and
more ‘must have’ content into the current release. It was time to optimize this process … lets
have a slightly closer look at the costs we were absorbing.
The cost of a regression is exponentially proportional to the
time it takes to detect that regression.
This is because it is easy to fix a regression when it is introduced since
the change is fresh, does not have other changes built on top of it and the
people involved are available. Using our
waterfall style delivery process it on average took us 3 months to find a
regression. This needed to come down to
withn a single day.
The time that it would take us to execute a functional
regression of the Application Server was 6 weeks with around 70 Full Time
Equivalent employees. This process consistently
bleed over into other phases and over 75% of our Integration or System Verification
Scenarios would be blocked by a basic functional failure at some point in time. We had to get to the point where we could
execute a functional regression of the application server with little to no
human cost, and within hours not weeks.
We were hardware constrained. We had a lot of machines but try finding one
to use. Though our lab showed only 6% of
our infrastructure was in use at any given point in time it was always
assigned. Teams were spending time justifying new hardware requests,
overestimating what they needed and there was a bit of hording going on. We needed self-service access to
infrastructure and monitoring to govern misuse.
What else was costing us time and money? Organizational boundaries. We had many organizations responsible for a
particular delivery. Development teams, Functional
Test teams, System Persona teams, Performance Teams, Hardware teams, Test
Automation Teams … the list goes on and on.
As code transitions between teams there is a significant cost. Certain teams become bottlenecks and often
one team has a different set of objectives or incentives than another so do not
align. Development would throw code over
the wall and see it as ‘Test’s’ job to test it … Test would not gather enough
information when things did not work … teams were blocked by a lack of
infrastructure … 4 different automation infrastructures custom built for
specific purposes. These boundaries
existed for some good reasons but we had reached a point where they were
slowing us down too much and could not function in a world where resources were
shrinking not growing.
In the next section we will look at some of the things we
did to address these problems.