Rebecca Harris
- Jul 19, 2019
- 8 min read

The Unreasonable Effectiveness of Loading Test Data Automatically

Updated: Aug 20, 2019

Introduction

When creating a test automation suite, one aspect you need to cover off is the loading of data into the system under test. For each of your tests, the main aim is to set up the system in an initial state, providing the context in which to run that test.

At a recent client, my team was tasked with creating just such a test suite for that organisation’s core set of applications. As we built out the test suite, we implemented facilities for automating the loading of data into the applications under test, and utilised those facilities via libraries in our automated tests.

All of this is just as you would expect in a test automation initiative. What we did not expect was just how many indirect benefits and opportunities arose as a result of creating automated tools and libraries for loading data. There were, of course, benefits from creating automated tests, too, but I want to focus here on the surprising wins we had in creating those data load facilities.

Photo by Franck V. on Unsplash

The Old Way of Working

Prior to creating an automated test suite, our organisation had a team of testers who would follow a large number of manual scripts to test the applications. This would happen during the testing phase prior to each release, in a purpose-built, system testing environment. The manual testers would all share that environment. Each testing phase lasted up to six weeks. This has been a fairly typical way of working in most large organisations, though that is now beginning to change in many places.

In order to run one of the manual scripts, a tester would often need to find some data in the system as a starting point for their test. Generally, this would mean searching for a customer in the system with a specific set of attributes matching what they required for their test. This saved going through the laborious process needed to create and modify a new customer and their associated data, across more than one application. There were also delays built in to the system for the creation of new customers. Some of these delays were incidental, needed to synchronise data between applications, while some were intentional parts of the business process (cooling off periods, etc). Creating live customer data in just the right state could take some time and effort.

Oftentimes, the testers needed to create other data linked to their selected customer, representing insurance policies and claims, as well as a variety of associated membership information. Typically this involved manually interacting with the core applications to create the necessary pieces to play with for that customer. This was fairly cumbersome and somewhat error prone, making repeatability of the manual tests a hit and miss affair.

Photo by Kotagauni Srinivas on Unsplash

A Better Way: The Nuts and Bolts

As an integral part of implementing our test automation suite, we created a set of code libraries for loading data into the various core applications. Initially, the only intention was to use these libraries in our own tests.

Wherever possible, we tried to call APIs exposed by the core applications to load data. This was preferable to either writing lower level database scripts or driving data through the application GUIs. The latter approaches tend to be brittle in the face of changes in the applications. They also tend to force dependencies on more complex and application-specific tooling (for interacting with GUIs or databases) than is needed for calling an API. In some cases, the internal mapping from the application to the data stored in the database was complex enough to preclude any thought of loading data directly. Where there was no API for the functionality we required, we resorted to driving data load through the application’s GUI, and put in a request for the developers to prioritise building an equivalent API. Once implemented, we could replace our brittle GUI loading code.

As we progressed, filling out the capabilities of these libraries, it became clear that others in the organisation could also find use for them. We refined the design of the library interfaces and improved the quality of their implementation, so that we could publish the libraries for others and support their use use more broadly.

While this enabled our data loading to be used within code by others, we strived to make these tools more accessible to non-developers, including the manual testers. Firstly, we wrapped our libraries with simple command line tools which we could distribute to the more technical of our testers. We then went further, providing simple form-based web applications for all manual testers.

When exposing the libraries in this way, it wasn’t necessary to expose the full functionality of the libraries. We kept it simple, sticking to just what the manual testers needed to do their jobs. Providing a full set of data loading facilities would have approached the complexity of the original core applications we were testing.

Photo by Franki Chamaki on Unsplash

Benefits and Opportunities

As mentioned, there were a number of benefits and opportunities, some of them unexpected, that arose as a result of creating these automated tools and libraries for loading data.

Speed

A key benefit of the tools was that they greatly decreased the time required to set up the system for manual tests. Rather than manually working their way through the UIs of multiple core applications, the testers could run a simple, constrained tool to get their desired setup.
On occasion, the testers would run one of the data load tools to reach a certain point in the setup process, and then customise the data further within the core applications. This was still much quicker than setting up everything themselves.

Creating Data Instead of Searching For It

Rather than creating data from scratch, manual testers would often try to speed up their tests by searching for pre-existing data that might fit their requirements. For some styles of test, this worked reasonably well for them. However, there were issues with this approach:

For a given test, they would generally go searching for the same data they had used in the last testing phase. There was no guarantee that data they had used to test a particular scenario in the previous testing phase would still be available in the current set of production data. The data may have been removed or could have had its attributes modified in a way that no longer suited the given test scenario.
For some tests, there was no straightforward way to search for the relevant data.

By creating and loading the data for them automatically, they could have an assurance that the required data existed in exactly the form they needed.

Parallel Testing

There were often multiple manual testers working in the same system testing environment. If they chose to search for existing data, there was the possibility that two testers could find and use the same data for their test. If either of those tests involved updates, it meant that the tests could interfere with each other.
The data in the systems we were testing was naturally scoped by customer name, which meant that would use a uniquely generated customer surname to ensure non-overlapping data.
The test data we generated and loaded utilised these uniquely scoped IDs (surnames) such that each test was guaranteed to have its own unique data with which to work. This enabled manual tests to proceed in parallel in the same environment.
As a bonus, this allowed us to run our automated tests in the same environment as the manual testers, at the same time without clashing.

Repeatability

By either creating new data manually, or searching for existing data which may have been modified since the last testing phase, manual testers ran the risk that the initial data might not exactly match what was used the last time a scenario was run, potentially affecting the system behaviour and the ultimate result of their test.
Using data generated by our tools ensured that it was never unintentionally changed, making manual tests more repeatable. Testers had full control of their test data.
With this repeatability, it was then easier to reproduce test scenarios when required, for investigation by developers.

Increased Data Complexity/Realism

By creating tools which set up data across several core applications at once, while ensuring that the data was correctly related across those systems, it was possible to emulate more realistic customer scenarios. Such scenarios had generally been avoided previously by manual testers, due to the complexity of the setup process involved.

Extensibility

By providing sophisticated, yet realistic, common starting points in the data setup, our tools encouraged the testers to consider creating variations on the data which began from these points. This enabled them to test alternate scenarios which would have taken too much time to set up previously. This opened up a new window into fertile exploration testing.
The manual testers sometimes needed collections of similar data, perhaps to test searching and filtering functionality in the applications, or to test how the UI scaled when there were more items in a set of results than might fit on a single page. Adding facilities to help them with this proved relatively easy, given that we already had code to generate and load the individual data items.

Rolling Back The Mask

Some time back, a decision had been made to roll recent production data back through the staging and test environments. This has been a commonplace practice in many organisations and there were some good reasons for doing so in this particular case:

It ensured that any data migration scripts for moving to new versions of the various applications would work with all existing production data.
It ensured that any code that ran over all or large portions of the data would work with existing production data.
It provided the manual testers with a realistic data set from which to choose interesting and relevant examples.

However, there were several costs associated with this decision:

As some production data was of a sensitive nature, the data needed to be masked before it could be loaded into the non-production environments.
At the time of the original decision, data sizes were small, but they grew over time to the point where the data masking process took a long time to run (several weeks) and required a scaled up server.
The length of the process meant that it required significant nursing from operations people to complete.
Delivery of data back into the non-production environments was significantly delayed, making the production data out of date. As such, its usefulness for checking that various scripts worked with all production data was somewhat reduced.
There were a variety of batch processes that would automatically run in the applications. These processes worked on any prior data which had not previously been processed in that environment. This meant that a large number of batch processes fired up whenever a new set of production data was loaded, and these processes ran for many hours and even days, interfering with capacity required for

Once we provided the manual testers with another means for obtaining the data they needed, we revisited the case for rolling back masked production data, given its costs. It turned out that over the years, very few issues had been caught relating to scripts not working with realistic production data. What issues there were of this sort could generally be caught with our generated test data.

So the decision was made to abandon the existing process of rolling back production data and all of its associated costs. This is a good example of how you can revisit your assumptions about existing processes once you have automatic loading of test data in place. The end result was not something we could have predicted when we first started down this path.

Photo by rawpixel on Unsplash

Conclusion

So loading data automatically, as generally required for test automation, can have can have a set of benefits across a broader testing effort and elsewhere in your organisation. In our case, for example, it simplified operations, which was a surprising result.

When assessing the value of test automation, you should be ready to weigh up the possibility of these sorts of opportunities on the plus side of any business case, in addition to the more obvious benefits you will reap.