Coding Libetartian

Steve Moyer's thoughts on creating software for remuneration and finding the freedom to do it.

Regaining Agility in the Face of Legacy Databases

| Comments

Problem Statement:

As organizations embrace iterative software development they often find themselves weighed down by their legacy databases. These databases seem nearly impossible to understand. They have multiple levels of tangled stored procedure calls, missing relational integrity, obsolete and duplicated data, circular references between databases, etc. Some employees understand portions of the database. Nobody wants to make any significant changes for fear of unintended consequences.

Developers(henceforth devs) and Quality Engineers(QEs) typically work off of a shared development environment which hosts backups from production. The development environment often diverges from production as devs make their changes directly to the shared database. QEs add data to support their testing. Temporary changes often persist and the data is in a constant state of flux. The devs and QEs are reluctant to restore new backups as they may lose the schema changes and data they have built up.

Database Administrators(DBAs) tend to host their own production copies outside of the development environment. They’re less interested in what is going on in development. They are concerned with Production.They deploy changes to production on their own cycle and often do not apply the changes to the development environment.

When it’s time to deploy a release, a barrage of development changes are dropped on the DBAs. They most likely do not have context for the development changes and don’t have much time to review them They are not happy that someone sent a script that would overwrite the performance optimizations they have already deployed to production. Eventually the changes are applied and verified in a staging environment. There are times when releases are smooth. More often, they are followed by a mad scramble to fix the problems.

Unfortunately, this situation is more the rule than the exception. Within the software development process, the lowest cost next step appears to be adding another table/procedure/band aid to the picture. Priorities, deadlines, and fear rarely afford the opportunity to reduce the technical debt and improve the process in a meaningful way. How can an organization regain flexibility and start to chip away at the mounting cost of change?

Goals and Guidelines For a Solution:

If all of your database applications were well-understood, had robust automated test coverage, or were not under development, it’s likely you would have stopped reading. To improve the situation, what needs to be accomplished?

Build Knowledge about the Databases

Legacy databases contain a large amount of latent knowledge. Stored procedures are often the source of truth when it comes to portions of business logic. How the tables relate to each other may describe requirements of the system. Quirks in the design may tell parts of the development story. The people who originally held this knowledge may be long gone. It would be a shame to spend significant time untangling the web only to start over the next time someone moves on.

Allow for an Iterative Progress

To solve a problem of this size it is necessary to take it one piece at a time. Legacy databases are usually too complex for any one person to hold in memory. When one section is understood, the knowledge can be captured and the process can move to the next section. It should be easy run the automated process over and over.

Generate a Minimum Viable Database

Copies of production databases are great for troubleshooting and reporting. They’re terrible for development and automated testing. For development and testing, the dataset should be small enough to understand what is in it. The data it contains should be meaningful. It also needs to allow a functional application to run on top of it.

Keep the Schema Close to Production

Once the Minimum Viable Database is reasonably under control, you’re going to want to start making schema and procedure changes, eventually propagating them to production. Maintaining a list of things that are different in development and production is annoying and error prone. Keep the list as short as possible.

Version Changes

The schema, data, and other products of the process are important and valuable. They should be kept in source control alongside the application code. While database backups are useful for many purposes, they should be avoided as products of the process.

Solving the Problem:

Many tools on the market seem to go a long way towards unraveling the problem. Some compare two database structures and generate differential scripts to sync them. Others generate test data. None of them are open, robust, iterative, and able to be fully automated. Keep them in your arsenal. They will come in handy. Through exercising the following process you can accomplish the goals stated above.

Create a Schema Baseline

1. Generate schema and stored procedure scripts

Generate scripts capable of creating your databases complete with schema and stored procedures but without data. Modify the scripts until they run successfully through without intervention.

Create an Initial Dataset

2.1 Pick a Target Feature

Select a feature or area of the application to work on. A good starting place my be log a user into the application and viewing the start screen.

2.2 Automate a test to perform the steps(Optional but recommended)

The feature will be tested over and over as additional data is added. To save time and effort, an automated test should be created for. As the application my not be functioning, a working deployment of the application can be used to construct your test.

2.3 Run the Selected Feature

Exercise the test(s) of the selected feature. Debug the application or review errors to determine which missing data caused the failures. If the feature is working as expected, return to step 2.1.

2.4 Select a Group of Tables

Select a small numbers of tables related to the feature being worked on.

2.5 Create Data to Populate the Selected Tables

There are many strategies to get data for a given table:

  • Create it manually in the database tables
  • Create it manually in storage files(xml,csv,tab delimited, etc.)
  • Copy a filtered subset from production(if appropriate). Recommended for reference tables
  • Use the working portions of the application to create data
  • Use a data generator
  • Copy and modify other data you have already created

2.6 Export the data to storage files, if necessary

Storage files should be human readable, version-able, and diff-able (CSV,tab delimited,XML, etc.). Unless the data was entered directly into the storage files it will need to be exported to them. There are many tools that can export data to the above mentioned file types. Select a tool that suitable tool as needed. Exporting will be done repeatedly and must be completely automated.

2.7 Re-create the databases and load the data

Re-create the database schemas from scratch and load all of the data from the storage files. There are many options for loading the data from the storage files. Most database tools have bulk load facilities. The re-creation and load should be transparent on failure and must be completely automated. If the load fails, addition data may need be created for dependent tables. When the load is successful, return to step 2.3 and continue until the feature is working.

2.8 Commit changes

Working in small cycles is highly recommended. Commit to version control frequently. If desired, return to step 2.1 and select another feature.

Ongoing Development

Changing Schema and/or Stored Procedures

Once the baseline structure is established, version control should be treated as the beginning of the path to production for schema and stored procedure changes. To introduce a change, create a numbered or time-stamped migration script. Store the migration scripts in a single location within your version control.

Adding or Changing Data

The process of adding and changing data is similar to the initial data creation process. Load the current the data locally, or in an isolated environment. Change the data as desired. Care should be take to minimize the changes to only those that are desired. Export the data again and commit the changes.

Application Releases

Migration scripts can be packaged with application builds and deployed to a given environment at the same time as the application. Good migration systems track which migrations have been applied and run only migrations which have not already been applied to the target environment.


Reduced Risk of Releases

With the database build, migration, and load process completely automated, each member of the team can operate on a local copy of the database. DBAs can easily follow the proposed changes to the databases and expose their own to developers. All internal releases (Continuous Integration, test, staging) can employ the automation as well. Before hitting production, all changes will have been executed many times. Breaking changes will be visible earlier when intervention is less costly.

Increased Consistency of Automated Tests

Employing an explicitly created dataset, which does not rely on consistency of data refreshed from production, will decrease false negative test results.

Increased developer productivity

Working with smaller datasets in a local environment allows devs to cut down cycle times and suffer fewer unintended collisions. Working locally also significantly decreases the impact of network or shared environment downtime.