Coding Libetartian

Steve Moyer's thoughts on creating software for remuneration and finding the freedom to do it.

Working With Legacy Databases: A Case Study

| Comments

I recently blogged about a process for Regaining Agility in the Face of Legacy Databases. A general process is nice but I find that a real example is more helpful.

The client’s primary application was a domain-specific content management system. The project charter was to help implement iterative development and automated testing practices while delivering new features. Upon arriving, my ThoughtWorks colleagues and I realized that the legacy code and database would make it extremely risky to make any significant changes. This, of course, was no surprise to the team of client developers. While many development and testing related activities were going on at the same time, this post will focus on the databases and automated testings.

Given the state and size of the code, unit testing all (or even most) of the application was not a feasable way to reduce regression in the code base at large. In order to reduce the risk, we needed to add automated tests. The decision was made to begin adding browser-based system tests. For these tests to be at all consistent and maintainable, we needed to get the database under control and enable team members to work and build tests in local isolation.

The Situation:

  • Three primary Sql Server databases
    • Total size of the databases without audit and historical data: ~75GB
    • Circular references between databases
  • 100-150 tables
    • Significant data movement back and forth between sets of similar tables
    • Significant data which was not valid under the schema constraints
  • ~1000 stored procedures
    • Many unable to be compiled due to missing references
    • DBAs/developers already removed many to get to ~1000
  • No automated tests
  • Many significant legacy issues in the codebase
    • Three+ active data access layers (large amounts of overlap)
    • Wide scale copy and paste with slight divergence
    • Duplicated code in C# and
  • Developers primarily using a shared set of databases on a development server
    • months out of date/rarely updated
  • DBAs
    • Had already flagged many out of date stored procedures for recompile with plans to delete if not accessed within a few months
    • Had consolidated ~20 databases down into a much smaller number
    • Were resistant to automation
    • Were resistant to using version control
    • Were resistant to participating in the development process
  • Releases
    • Multi-week manual QA process
    • Multiple consecutive troubled releases

Prior attempts at wrangling the databases

  • Visual Studio Database Projects
    • Didn’t deal well with multiple (troubled) databases
    • Not incremental, change scripts generated at time of deployment based on database snapshots
    • Implicit changes created based of the current state of the complete database
  • Visual Studio Data Generation Plans
    • Doesn’t produce meaningful data
  • Jailer
    • More suited to a one time extraction

The Process

Primary Tools:, nDump, Sahi, ThoughtWorks Twist

Phase 1: Generating a baseline schema (One person over a few days)

Schema scripts were generated for each database using Microsoft Sql Server Management Studio. Initially one script per database. Additional scripts were created to work around issues related to ordering of database creation. Bodies of non-compilable stored procedures were emptied but were not completely removed. Leaving the empty stored procedures allowed using the migration process to delete them from production. Creation of the empty databases was fully automated by applying the scripts in a specific order using batch files. Each work around that was created was a spotlight on a problem to be addressed in the future.

Phase 2: Generating an initial dataset (One person over ~3-4 Weeks)

Data generation started with tables necessary to use the log in screen of the content management system. Once logging in was possible, data generation progressed to a few other core features such as the content editor. Data was extracted using nDump which exports a set of tables in a manually configured order. As with script generation in phase 1, the problems which came to light during this phase pointed out problems to be corrected in the future.

Multiple data extraction techniques were used to populate the dataset:

Full table extraction from production was used to populate lookup tables using nDump.

Partial table extraction from production was used to selectively export small subsets of data using nDump’s optional per table filtering.

The target application itself was also used to create data.

Small amounts of data were typed manually into sql tables or into the CSV export files.

After each extraction, the databases were rebuilt and the data was reloaded using nDump. This ensured that the data would pass constraints and go into the databases successfully. Often, foreign keys were violated indicating addition tables needed to be added to the export process.

Phase 3: Initial browser based testing and continuous integration (Two people over ~1 week)

Multiple test franework options were considered independent of this of this database effort. Sahi was selected as the tool of choice for browser automation and Twist was selected for test suite management. Once a small set of features were up and running, an initial automated test was created. The database build and initial automated test were run as a job in the Jenkins CI server.

Phase 4: Extending the team and moving forward (Ongoing effort across multiple distributed teams)

With the build and test groundwork in place, the responsibility for writing tests and creating test data were extended to the teams creating new features as well as an additional team set up specifically to create a regression suite. All development changes to the database schemas and stored procedures were done through migration scripts run with Picking a number for a migration script and generating new test data were each limited to one person at a time. This was done to avoid merge conflicts which were particularly difficult given database generated ids which often collided. To ensure only one person was engaging in these activities at a time, the Pass The Puppy - Distributed Token Application was used. If all team members are in a single location, it is easier to use a physical token such as a stuffed animal.

On each commit to source control, the databases were rebuilt from scratch. The current set of migration scripts was then applied. In addition, the migrations were separately applied to production backups. This ensured that the changes, which worked with the test databases, would also work when applied to production. Once the databases were built, the browser tests of the active development teams were run against them. A few times each day, the entire regression suite was run against the current state of the code and databases.


Baseline scripts with migrations run by

  • Pros:
    • Transparent and customizable process
    • Easy to follow the evolution of the schema and stored procedures
    • Changes are explicit
    • Repeatable
  • Cons
    • A few hours of effort were required to update and audit the baseline after a major release
      • Releases with only a few changes may be applied directly to the existing baseline
    • Developers are forced to script database changes

CSV based data imported and exported with nDump

  • Pros:
    • Repeatable
    • Transparent
    • Incremental
      • Continually add data as needed
    • Easy to diff
    • Very flexible, use multiple strategies for data generation
  • Cons:
    • CSV files have no native differentiation between empty and null strings
    • Data modifications must be serialized, significant merges within an overlapping set of tables are not reasonable
    • The data load process can be quite slow depending on the method used. Generally there is a tradeoff between data validation and speed.

The Results

Progress made

  • 3 primary Sql Server databases (Total size of test databases: < 150MB)
  • ~100 CI builds per day
  • ~3000 browser based regression tests (tested in CI)
  • Over 500 database migrations to date (tested in CI)
  • Releases
    • A few days of manual QA prior to release
    • New features visible to quality analysts and business users well before release
    • Much less stressful
  • Developers and Quality Engineers work in local isolation (internet connection optional)
    • Updated on each pull from version control
  • DBAs
    • A bit more involved in the development process
    • Still resistant to using version control


Over the next couple years, multiple projects were successfully delivered. Some of these projects required wide-ranging changes to the databases. With the added automated test coverage and improved development practices, releases were less stressful and more frequent. The relatively basic toolset (, nDump, scripting) allowed for customization at each step of the process. Other tools attempted offered a nice user interface but left few options when faced with challenges the tool developer had not accounted for. The test data and database development process was not without pain and effort. It alone is not responsible for too much of the progress that has been made. I do think it was crucial in allowing the dedicated test and development teams to significantly improve the situation.