Coding Libetartian

Steve Moyer's thoughts on creating software for remuneration and finding the freedom to do it.

Fluent Java Collection Builders With Lambdas and VarArgs

| Comments

Lambdas have finally come to Java with the release of Java 8. Varargs have been around. I think that the combination of varargs and lambdas will be helpful in two different areas: test-data set generation and builders. For Example:

public class CollectionBuilder<T> extends ArrayList<T> {
    public CollectionBuiler<T> split(UnaryOperator<CollectionBuilder<T>> operations...) {
        return this;

We’ve taken advantage of Java’s UnaryOperator lambda signature for a function that takes one parameter and returns the same type. Why is this function called split? We’ll get to that in a minute.

Contrived Scenario: Testing filter criteria Given a function which takes a List of Orders and filters them based on criteria

List<Order> complexFilter(List<Order> orders) {


To test the above method, we’d like to use the collection builder to continually build up similar objects. At some point we’d like them to have unique point. After a unique point we’d like to be able to continue configuring them as a group. When created, the builder should contain one object. As we add details we can duplicate the objects and add individual values and then continue configuring.

With only var args we can do something like:

CollectionBuilder<Order> builder = new CollectionBuilder<Order>();

Each time we pass more than one value as var args the builder will duplicate the the existing rows and set the property value on a single copy. In this case we end up with:

[{customer: acme,
    duedate: somedate,
    ordertype: sometype,
    ordervalue: 115.00,
{customer: acme,
    duedate: somedate,
    ordertype: sometype,

What if we want two properties to differ between the elements:

[{customer: acme,
    duedate: anotherDate,
    ordertype: sometype,
    ordervalue: 115.00,
{customer: acme,
    duedate: somedate,
    ordertype: sometype,

We can try that with only varargs:

CollectionBuilder<Order> builder = new CollectionBuilder<Order>();

In this case we would end up with four objects instead of two. Fortunately lambdas can give us the ability to control this granularity by setting multiple properties on a subset.

CollectionBuilder<Order> builder = new CollectionBuilder<Order>();

This gets us the two objects that we wanted. What if we want the list to have distinct groups of objects? We can apply multiple function to the builder, perhaps using named functions rather than lambdas.

CollectionBuilder<Order> builder = new CollectionBuilder<Order>();

Notice after combining the orders we can still set any properties that we want to have the same value across the collection.

Since the builder only splits the collection when it gets multiple values(or functions) we can also chain multiple named configurations to a single object/set of objects.

CollectionBuilder<Order> builder = new CollectionBuilder<Order>();

Each of these functions could be a simple setting of a property or create complex child object(s). For small numbers of simple objects it’s easy to just create them or use basic fixtures but as the complexity and number of objects grows, this vararg + lambda strategy can be pretty powerful.

Working With Legacy Databases: A Case Study

| Comments

I recently blogged about a process for Regaining Agility in the Face of Legacy Databases. A general process is nice but I find that a real example is more helpful.

The client’s primary application was a domain-specific content management system. The project charter was to help implement iterative development and automated testing practices while delivering new features. Upon arriving, my ThoughtWorks colleagues and I realized that the legacy code and database would make it extremely risky to make any significant changes. This, of course, was no surprise to the team of client developers. While many development and testing related activities were going on at the same time, this post will focus on the databases and automated testings.

Given the state and size of the code, unit testing all (or even most) of the application was not a feasable way to reduce regression in the code base at large. In order to reduce the risk, we needed to add automated tests. The decision was made to begin adding browser-based system tests. For these tests to be at all consistent and maintainable, we needed to get the database under control and enable team members to work and build tests in local isolation.

The Situation:

  • Three primary Sql Server databases
    • Total size of the databases without audit and historical data: ~75GB
    • Circular references between databases
  • 100-150 tables
    • Significant data movement back and forth between sets of similar tables
    • Significant data which was not valid under the schema constraints
  • ~1000 stored procedures
    • Many unable to be compiled due to missing references
    • DBAs/developers already removed many to get to ~1000
  • No automated tests
  • Many significant legacy issues in the codebase
    • Three+ active data access layers (large amounts of overlap)
    • Wide scale copy and paste with slight divergence
    • Duplicated code in C# and
  • Developers primarily using a shared set of databases on a development server
    • months out of date/rarely updated
  • DBAs
    • Had already flagged many out of date stored procedures for recompile with plans to delete if not accessed within a few months
    • Had consolidated ~20 databases down into a much smaller number
    • Were resistant to automation
    • Were resistant to using version control
    • Were resistant to participating in the development process
  • Releases
    • Multi-week manual QA process
    • Multiple consecutive troubled releases

Prior attempts at wrangling the databases

  • Visual Studio Database Projects
    • Didn’t deal well with multiple (troubled) databases
    • Not incremental, change scripts generated at time of deployment based on database snapshots
    • Implicit changes created based of the current state of the complete database
  • Visual Studio Data Generation Plans
    • Doesn’t produce meaningful data
  • Jailer
    • More suited to a one time extraction

The Process

Primary Tools:, nDump, Sahi, ThoughtWorks Twist

Phase 1: Generating a baseline schema (One person over a few days)

Schema scripts were generated for each database using Microsoft Sql Server Management Studio. Initially one script per database. Additional scripts were created to work around issues related to ordering of database creation. Bodies of non-compilable stored procedures were emptied but were not completely removed. Leaving the empty stored procedures allowed using the migration process to delete them from production. Creation of the empty databases was fully automated by applying the scripts in a specific order using batch files. Each work around that was created was a spotlight on a problem to be addressed in the future.

Phase 2: Generating an initial dataset (One person over ~3-4 Weeks)

Data generation started with tables necessary to use the log in screen of the content management system. Once logging in was possible, data generation progressed to a few other core features such as the content editor. Data was extracted using nDump which exports a set of tables in a manually configured order. As with script generation in phase 1, the problems which came to light during this phase pointed out problems to be corrected in the future.

Multiple data extraction techniques were used to populate the dataset:

Full table extraction from production was used to populate lookup tables using nDump.

Partial table extraction from production was used to selectively export small subsets of data using nDump’s optional per table filtering.

The target application itself was also used to create data.

Small amounts of data were typed manually into sql tables or into the CSV export files.

After each extraction, the databases were rebuilt and the data was reloaded using nDump. This ensured that the data would pass constraints and go into the databases successfully. Often, foreign keys were violated indicating addition tables needed to be added to the export process.

Phase 3: Initial browser based testing and continuous integration (Two people over ~1 week)

Multiple test franework options were considered independent of this of this database effort. Sahi was selected as the tool of choice for browser automation and Twist was selected for test suite management. Once a small set of features were up and running, an initial automated test was created. The database build and initial automated test were run as a job in the Jenkins CI server.

Phase 4: Extending the team and moving forward (Ongoing effort across multiple distributed teams)

With the build and test groundwork in place, the responsibility for writing tests and creating test data were extended to the teams creating new features as well as an additional team set up specifically to create a regression suite. All development changes to the database schemas and stored procedures were done through migration scripts run with Picking a number for a migration script and generating new test data were each limited to one person at a time. This was done to avoid merge conflicts which were particularly difficult given database generated ids which often collided. To ensure only one person was engaging in these activities at a time, the Pass The Puppy - Distributed Token Application was used. If all team members are in a single location, it is easier to use a physical token such as a stuffed animal.

On each commit to source control, the databases were rebuilt from scratch. The current set of migration scripts was then applied. In addition, the migrations were separately applied to production backups. This ensured that the changes, which worked with the test databases, would also work when applied to production. Once the databases were built, the browser tests of the active development teams were run against them. A few times each day, the entire regression suite was run against the current state of the code and databases.


Baseline scripts with migrations run by

  • Pros:
    • Transparent and customizable process
    • Easy to follow the evolution of the schema and stored procedures
    • Changes are explicit
    • Repeatable
  • Cons
    • A few hours of effort were required to update and audit the baseline after a major release
      • Releases with only a few changes may be applied directly to the existing baseline
    • Developers are forced to script database changes

CSV based data imported and exported with nDump

  • Pros:
    • Repeatable
    • Transparent
    • Incremental
      • Continually add data as needed
    • Easy to diff
    • Very flexible, use multiple strategies for data generation
  • Cons:
    • CSV files have no native differentiation between empty and null strings
    • Data modifications must be serialized, significant merges within an overlapping set of tables are not reasonable
    • The data load process can be quite slow depending on the method used. Generally there is a tradeoff between data validation and speed.

The Results

Progress made

  • 3 primary Sql Server databases (Total size of test databases: < 150MB)
  • ~100 CI builds per day
  • ~3000 browser based regression tests (tested in CI)
  • Over 500 database migrations to date (tested in CI)
  • Releases
    • A few days of manual QA prior to release
    • New features visible to quality analysts and business users well before release
    • Much less stressful
  • Developers and Quality Engineers work in local isolation (internet connection optional)
    • Updated on each pull from version control
  • DBAs
    • A bit more involved in the development process
    • Still resistant to using version control


Over the next couple years, multiple projects were successfully delivered. Some of these projects required wide-ranging changes to the databases. With the added automated test coverage and improved development practices, releases were less stressful and more frequent. The relatively basic toolset (, nDump, scripting) allowed for customization at each step of the process. Other tools attempted offered a nice user interface but left few options when faced with challenges the tool developer had not accounted for. The test data and database development process was not without pain and effort. It alone is not responsible for too much of the progress that has been made. I do think it was crucial in allowing the dedicated test and development teams to significantly improve the situation.

Regaining Agility in the Face of Legacy Databases

| Comments

Problem Statement:

As organizations embrace iterative software development they often find themselves weighed down by their legacy databases. These databases seem nearly impossible to understand. They have multiple levels of tangled stored procedure calls, missing relational integrity, obsolete and duplicated data, circular references between databases, etc. Some employees understand portions of the database. Nobody wants to make any significant changes for fear of unintended consequences.

Developers(henceforth devs) and Quality Engineers(QEs) typically work off of a shared development environment which hosts backups from production. The development environment often diverges from production as devs make their changes directly to the shared database. QEs add data to support their testing. Temporary changes often persist and the data is in a constant state of flux. The devs and QEs are reluctant to restore new backups as they may lose the schema changes and data they have built up.

Database Administrators(DBAs) tend to host their own production copies outside of the development environment. They’re less interested in what is going on in development. They are concerned with Production.They deploy changes to production on their own cycle and often do not apply the changes to the development environment.

When it’s time to deploy a release, a barrage of development changes are dropped on the DBAs. They most likely do not have context for the development changes and don’t have much time to review them They are not happy that someone sent a script that would overwrite the performance optimizations they have already deployed to production. Eventually the changes are applied and verified in a staging environment. There are times when releases are smooth. More often, they are followed by a mad scramble to fix the problems.

Unfortunately, this situation is more the rule than the exception. Within the software development process, the lowest cost next step appears to be adding another table/procedure/band aid to the picture. Priorities, deadlines, and fear rarely afford the opportunity to reduce the technical debt and improve the process in a meaningful way. How can an organization regain flexibility and start to chip away at the mounting cost of change?

Goals and Guidelines For a Solution:

If all of your database applications were well-understood, had robust automated test coverage, or were not under development, it’s likely you would have stopped reading. To improve the situation, what needs to be accomplished?

Build Knowledge about the Databases

Legacy databases contain a large amount of latent knowledge. Stored procedures are often the source of truth when it comes to portions of business logic. How the tables relate to each other may describe requirements of the system. Quirks in the design may tell parts of the development story. The people who originally held this knowledge may be long gone. It would be a shame to spend significant time untangling the web only to start over the next time someone moves on.

Allow for an Iterative Progress

To solve a problem of this size it is necessary to take it one piece at a time. Legacy databases are usually too complex for any one person to hold in memory. When one section is understood, the knowledge can be captured and the process can move to the next section. It should be easy run the automated process over and over.

Generate a Minimum Viable Database

Copies of production databases are great for troubleshooting and reporting. They’re terrible for development and automated testing. For development and testing, the dataset should be small enough to understand what is in it. The data it contains should be meaningful. It also needs to allow a functional application to run on top of it.

Keep the Schema Close to Production

Once the Minimum Viable Database is reasonably under control, you’re going to want to start making schema and procedure changes, eventually propagating them to production. Maintaining a list of things that are different in development and production is annoying and error prone. Keep the list as short as possible.

Version Changes

The schema, data, and other products of the process are important and valuable. They should be kept in source control alongside the application code. While database backups are useful for many purposes, they should be avoided as products of the process.

Solving the Problem:

Many tools on the market seem to go a long way towards unraveling the problem. Some compare two database structures and generate differential scripts to sync them. Others generate test data. None of them are open, robust, iterative, and able to be fully automated. Keep them in your arsenal. They will come in handy. Through exercising the following process you can accomplish the goals stated above.

Create a Schema Baseline

1. Generate schema and stored procedure scripts

Generate scripts capable of creating your databases complete with schema and stored procedures but without data. Modify the scripts until they run successfully through without intervention.

Create an Initial Dataset

2.1 Pick a Target Feature

Select a feature or area of the application to work on. A good starting place my be log a user into the application and viewing the start screen.

2.2 Automate a test to perform the steps(Optional but recommended)

The feature will be tested over and over as additional data is added. To save time and effort, an automated test should be created for. As the application my not be functioning, a working deployment of the application can be used to construct your test.

2.3 Run the Selected Feature

Exercise the test(s) of the selected feature. Debug the application or review errors to determine which missing data caused the failures. If the feature is working as expected, return to step 2.1.

2.4 Select a Group of Tables

Select a small numbers of tables related to the feature being worked on.

2.5 Create Data to Populate the Selected Tables

There are many strategies to get data for a given table:

  • Create it manually in the database tables
  • Create it manually in storage files(xml,csv,tab delimited, etc.)
  • Copy a filtered subset from production(if appropriate). Recommended for reference tables
  • Use the working portions of the application to create data
  • Use a data generator
  • Copy and modify other data you have already created

2.6 Export the data to storage files, if necessary

Storage files should be human readable, version-able, and diff-able (CSV,tab delimited,XML, etc.). Unless the data was entered directly into the storage files it will need to be exported to them. There are many tools that can export data to the above mentioned file types. Select a tool that suitable tool as needed. Exporting will be done repeatedly and must be completely automated.

2.7 Re-create the databases and load the data

Re-create the database schemas from scratch and load all of the data from the storage files. There are many options for loading the data from the storage files. Most database tools have bulk load facilities. The re-creation and load should be transparent on failure and must be completely automated. If the load fails, addition data may need be created for dependent tables. When the load is successful, return to step 2.3 and continue until the feature is working.

2.8 Commit changes

Working in small cycles is highly recommended. Commit to version control frequently. If desired, return to step 2.1 and select another feature.

Ongoing Development

Changing Schema and/or Stored Procedures

Once the baseline structure is established, version control should be treated as the beginning of the path to production for schema and stored procedure changes. To introduce a change, create a numbered or time-stamped migration script. Store the migration scripts in a single location within your version control.

Adding or Changing Data

The process of adding and changing data is similar to the initial data creation process. Load the current the data locally, or in an isolated environment. Change the data as desired. Care should be take to minimize the changes to only those that are desired. Export the data again and commit the changes.

Application Releases

Migration scripts can be packaged with application builds and deployed to a given environment at the same time as the application. Good migration systems track which migrations have been applied and run only migrations which have not already been applied to the target environment.


Reduced Risk of Releases

With the database build, migration, and load process completely automated, each member of the team can operate on a local copy of the database. DBAs can easily follow the proposed changes to the databases and expose their own to developers. All internal releases (Continuous Integration, test, staging) can employ the automation as well. Before hitting production, all changes will have been executed many times. Breaking changes will be visible earlier when intervention is less costly.

Increased Consistency of Automated Tests

Employing an explicitly created dataset, which does not rely on consistency of data refreshed from production, will decrease false negative test results.

Increased developer productivity

Working with smaller datasets in a local environment allows devs to cut down cycle times and suffer fewer unintended collisions. Working locally also significantly decreases the impact of network or shared environment downtime.

The Keyboard Is for Everyone

| Comments

Many jobs today require heavy computer and internet use.  How fluently you perform these activities affects how much of your time and focus you can dedicate to what you are trying to accomplish. Most of the time, switching between the keyboard and mouse slows you down. In addition, for some people, using the mouse is painful.

Most people can use CTRL-C and CTRL-V to copy and paste. Some people have invested a fair amount of time in learning keyboard shortcuts(and avoiding the mouse). Unfortunately many other people, including developers, don’t think keyboard shortcuts are relevant for them. Some simply don’t know what they’re missing. I think this is a tragedy. Most applications support keyboard shortcuts and if you haven’t already, take 15 or 30 minutes as you go about some task and try an learn the shortcuts.  It can be rough at first but will pay dividends pretty quickly. Here are some activities where I think sticking to the keyboard is valuable for all computer users.

Switching Between Applications and Tabs (primarily Windows):

Alt-Tab = Switch to the next running application(Alt-Shift-Tab to go to previous)
Ctrl-Tab = Go to the next tabbed window(Ctrl-Shift-Tab to go to previous)
Alt-F4 = Close the currently selected application
Ctrl-F4 = Close the currently selected tab


If you use gmail and haven’t yet enabled keyboard shortcuts, you’re missing out. Managing your inbox is so much easier with the keyboard. Once you turn on keyboard shortcuts (assuming your cursor isn’t in a textbox) you can type “?” to see what shortcuts are available.

A few basics:

gi = go to inbox
gl = go to label
c = compose
r = reply
/ = activate search box

Actions on messages in your inbox:

After you turn on keyboard shortcuts you may notice that there is now a small shaded cursor next to one message in the list.
j = move to next message
k = move to previous message in the list
x = select or de-select the message
e = archive selected messages
# = delete selected messages

Surfing the Web

For most people, surfing the web is a whole lot of clicking. It doesn’t have to be. There are a number of browser add-ins that help you avoid the mouse.

Vimium(for Chrome)

After installing Vimium you can hit the “?” key to see what keyboard shortcuts are available

A few basics:
j = scroll down(notice a pattern here?)
k = scroll up
f = display a key combination to follow links, select textboxes, click buttons, etc.
H = go back in history
Watch the video to see it in action.

Vimperator(for Firefox) is similar to vimium
Pentadactyl (for Firefox) is also good but not for the faint of heart.

What are your favorite tools/shortcuts that help you stick to the keyboard?

Asynchronous Pair Programming(Duet Programming?)

| Comments

Over the last few weeks of 2011 I’ve been playing around with Vim and Rails. I know many people are pairing using Vim. I also know people are using Vim and screen along with some sort of voice chat to pair remotely. One of the limitations(and strengths) of pairing is that both users are working in the same session. Only one person can be typing/browsing/whatever at a time. This can be good when you’re writing code but tends to suck when you’re doing research (on the web or in the code base).

Introducing screen to the mix has the ability to break this dependency on a single session. Each member of the pair can have their own instance of vim in half of a split terminal. Everything that is happening in the screen terminal is still visible to each member of the pair but they have the ability to be active at the same time in separate “windows” of the screen session. Can you write the next failing test while I extract this method? Can you look up what parameters the method I’m about to call expects?

It’s quite possible this practice is widespread and I just haven’t seen it in my little corner of the .Net world. Are people already using screen this way? If so, how is it working out?

Basic A/B Testing in .net (With nToggle)

| Comments

In order to add the most basic support for A/B testing to nToggle I have added the ability to specify your own custom repository for the status of a toggle.  By implementing a custom repository we can write code to determine which version of a feature a given user of the applicaiton should see.

Example:(If you aren’t seeing the embedded code, click through to the full post.)
  In my application I have a drop down list for selecting models of cars.  The number of items in this list is large and I am going to replace it with an auto-complete textbox which I think will create a better user experience. I think this change could have a large impact on users of the application so I’m only going to release it to power users to get feedback. Everyone else will continue to see the old version of the application.

First I wrap the new and old code in a web feature toggle:

Second I add the new toggle to my web.config: Notice that my new toggle has a property on it called “repository”. This tells nToggle where to look for the status of this toggle.

The last step is implementing a repository which will determine if the current user is a power user:

That’s all there is to it. In just a few lines of configuration and code we have created a very simple A/B test of a new feature. Disclaimer: This method is intended for short term use during development and transition. Feature toggles should be short lived and I recommend removing them(and the branches you have created in your code) as soon as reasonable.

Distributed Check-in Tokens: Pass-The-Puppy

| Comments

On many software development teams there is a section of code or some development artifacts which will collide if two people/pairs check in at the same time.  In order to stop this from happening many teams adopt a check in token.  Only the people who have the token can check in the given resource.

On my current team we have two such resources:  our numbered database migration scripts (which will fail if any two have the same number) and our functional test data csv files.  

Co-located teams can use a stuffed animal(like ours above) or some other physical token.   For distributed teams a physical token is not an option.

To enable all of our team members to use the tokens I created a rails app I call pass-the-puppy. This app is simple and low volume, so a free Heroku account is sufficient to host it for the team.  Any team which needs  a set token can clone the app from github and create their own Heroku app.  Sample application

As a rails novice I was quite happy to discover how easy it was to deploy an app to Heroku.  Warning, it’s not aesthetically pleasing.

Numbing the Pain Isn’t Necessarily Good

| Comments

A few months ago I bought a pair of Vibram Five Fingers shoes.  When I first started wearing them I realized my normal tennis/running shoes had been hiding a few things from me.  The standard way I run/walk is on my heel.  This means that I’m limiting the ability of my foot to absorb some of the impact of taking a step. This causes more of the impact to go directly onto my heel and up my leg.  Also my Plantar Fascia were very very tight.  This is important feedback that I wasn’t getting.

When I had pain from running, I’d take Aleve to temper the pain and swelling, and continue to run the same way.  Eventually the stress lead to more significant problems that weren’t so easy to ignore.   By changing the way I step I’m able to eliminate a major contributor to the pain.  At first the change is hard.  I was sore for a while from using a new set of muscles.  It took focus to not fall back into the old step pattern.

Now that I’m adjusted, my calf muscles and plantar fascia are much more flexible. My calves haven’t cramped in months.  I’m able to comfortably walk barefoot on previously painful surfaces(stepping down on a rock in the street with your heel is not fun).

As usual, I’m only mentioning my Five Fingers experience as a metaphor for software development.

If you are developing an application and bugs keep popping up all over the place, you can just just fix the bug and keep on going adding and changing code.  Eventually you’ll spend the vast majority or your time poking and proding and trying to get the system almost stable.  You’ll be afraid to make any changes. Another alternative is that you can start writing tests.  Start with a test for that bug.  Writing tests will be painful at first.  You’ll be exposing all of your bad habits.  Your code will be difficult to test.  As you write more and more tests you’ll be able to start to clean up that code.  You’ll begin to gain the confidence that you can make changes and add features without introducing a bunch of new bugs.

If your release cycle is six months and the releases never seem to go as planned, you can add more time for manual regression or release only every year.  “It’s a painful 3 months but it only happens once a year”, you’ll say.  Alternatively you can decide you want to release every month or every week or every commit.  You can automate all of the steps to release.  You can write system tests that give you confidence that you haven’t missed deploying a stored procedure or changing permissions on that directory. You’ll have some of the same pain but if you’re writing tests it’ll be pain in smaller and smaller amounts. When you test a release after every commit and run tests against it, you’ll know immediately if you forgot to add that new step to the deployment

Usually, when there is pain, there are causes of the pain.  While you might need to numb the pain you’d better make sure you remove the cause of the pain as well.  Start fixing the cause first and numb only if you still need to. Stop numbing the pain as soon as possible.

Usage Is Not a Substitute for Tests

| Comments

Some months ago I encountered a large Sql Server database which was not particularly well understood and didn’t have a test data set. In order to develop a test data set I decided to use a tool(the precursor to nDump) which I had slapped together in about an hour in late 2009. The tool, though primitive, was already working reasonably well at importing CSV. It had no tests, unit or otherwise. All it had in the way of testing was my manual inspection of the application which consumed the database to verify that it was working correctly.

As I used the tool to understand and generate data for the Sql Server database I realized that the tool was buggy and lacking features. I decided to refactor and extend it. Because I was focused not on the tool but on the data set for the application, I continued to make changes and add features without adding tests. I did make an attempt to reasonably factor the new code in a similar manner to how I would if I had been writing tests. Not writing tests at that time was a mistake.

Later in the development I realized this and went back to add tests. How well had I done at writing testable code? Not all that well. Much of the code was easily testable but there were more than a few places where files were being created directly or concrete instances were being new-ed up in the middle of a method doing other work. It certainly takes much more effort to tease the dependencies out and add the tests after the fact. Hopefully this post will serve as a reminder to keep me from making the same mistake again.

Breaking Down Unruly Databases: The Problem

| Comments

In my last post on nDump I talked about a tool I’ve created to import and export data from MS SQL Server databases.  In this post I want to lay out some of my motivation for doing so.

Often when starting at a new client we arrive to find a legacy database/databases which seem nearly incomprehensible. Development and CI often run against a shared databases which are backups of production. The state of the data is in a constant flux.  If someone desires to run a local database, restoring at least a few gigs of database backups is required. One reason this is necessary is that the database(s) can’t easily be recreated from scripts. This is relatively easy to get around by breaking creation scripts into parts and employing tools such as DBDeploy to run them in order.

Once the problem of creating an empty database is tackled, nobody seems to have the requisite knowledge to populate it with the basic data necessary to run the application. There are many problems that spring from a lack of the ability to create minimal data sets:
  1. Returning to a known data set is time consuming, requiring restoring a large backup.
  2. The development and functional test data sets can’t be easily version-ed. While you can check in a backup, there is no reasonable way to diff it against a previous version.
  3. Functional tests are often brittle as they rely upon the state of the data which tends to be rather inconsistent.
  4. Finding data requires searching through a large data set.
  5. Deciphering incremental changes to the state of the data is no small challenge.
On past projects, the teams have stored development and test data sets as groups of csv or XML files. Both of these files can be easily version-ed, diff-ed, and modified using excel or open office. I prefer csv files as they are more concise and contain a lot less noise.  Once you have these file sets and the requisite knowledge of the database structure it’s relatively pleasant to work with them to maintain known state(s) of the data. By running the database build and load in a CI build we can continually verify that we know how to build a database. Using the populated database we can confirm that our functional tests of the application(s) are still working against a consistent data set.