Configuration Driven Development

You’ve heard of Test Driven Development and Feature Driven Development, maybe even Behavior Driven Development. But have you heard of Configuration Driven Development? Perhaps not, read on!

Introduction

While working on some recent code projects (both personal and professional), I realized there was going to be many instances of similar but slightly different use cases.

I naturally developed a style I enjoyed using where for each new feature, IO format, or other peculiarity, I wrapped the code in a set of configuration file options. Then adding a new use case that shares this same feature was as simple as specifying those options in the configuration file. Eventually I would approach feature completeness where new use cases could be added with only a few lines in the config file and no code.

I wanted to write about this style but after some Googling I came to find this approach has been aptly coined by others as Configuration Driven Development (CDD).

Introduction
What is CDD?
- Definitions
- Real World Analogy
When Should I Use CDD?
When Should I Not Use CDD?
Why Should I Use CDD?
How Do I Do It?
Example - COVID-19 Case Data Web Scraper

What is CDD?

This is a great definition of CDD from Andrew Evans’ post, which I can’t top so I’ll just quote here.

Traditionally we build applications like this:

Lead architects design around business requirements

Application is built and deployed

Changes are done through additional components (SOLID Principles) or painful refactor

With CDD we build applications like this:

Independent components are built first, starting at the atomic level

An interface (usually JSON) is defined to compose the higher level [application]

Combination of reusable components and JSON blueprint allows developers to easily build up and out

At its core, CDD is a way of using modularity to build a loosely coupled set of components that are then composed together using a common interface.

Definitions

Here are some key term definitions, which I’ll try to use consistently throughout.

Problem Space: A problem or set of problems to be solved with code, for which one is considering employing CDD. This could be within a broader code structure, or standalone.
Use Case: An instance of the problem space which should be solved. Use cases normally map 1-1 to applications.
Application: A composition of configured components that is runnable by the application runner in order to fulfill the needs of a particular use case.; The nature of an application depends on the problem space. It could be a full-fledged “application” in the traditional sense (à la App Store), like an interactive dashboard. Or it could be a workflow, pipeline, or other loop iteration within a larger program.
Application Runner: Code that instantiates, parameterizes, and executes applications from their configuration file definitions.
Stage: A collection of components that exhibit categorically similar behavior. Stages can be acted on in linear sequence by the application runner, in order to enforce a logical flow.; Stage organization is not required for CDD; its utility depends on the use case.
Configuration: A file defining how components are customized via parameters and composed together into applications.
Component: A code module that encapsulates a particular well-defined behavior. Often fits neatly into a behavior stage.
Parameter: A small customization to a component’s behavior that doesn’t rise to the level of becoming its own behavioral component. A smart default is usually provided.
Framework: The whole system of CDD pieces: components, configurations, and how they should be used together to compose applications.

In service of solving Use Cases in a Problem Space, we decide to use Configuration Driven Development. A Configuration file defines multiple compositions of Applications, from individual Components within Stages that are customized with Parameters. These applications are executed by the Application Runner.

Real World Analogy

To make these terms more clear, let’s bring them to life by applying analogies to a made-up real-world scenario of a Build-A-Home company.

Problem Space: People need houses but they like to be able to customize many aspects of the build for their wants and needs. However, most customers will not know the specifics of building procedure, codes, proper materials, etc.; they only want to specify form and function and have the details filled in for them.
Use Case: A person or family’s particular wants and needs when they are in the market for a customizable home.
Application: The completed home after it’s been built.
Application Runner: The construction crew that builds the house according to specifications.
Stage: A step of the building process such as: foundation, structure, plumbing/electrical, finish, interior, etc.
Configuration: The blueprints and plans detailing exactly how the house should be built, what features/components it should have, and how those should be customized.
Component: Features of the home that can be customized or specified. For example, within the structure stage, there might be layout options such as split-level, colonial, or row home. We can also optionally add a garage component or change the default interior layout to add bedrooms or bathrooms.
Parameter: Customizable details of a single component; for example, the overhead-lighting electrical component could have options for light style, number, and position.
Framework: The entire Build-A-Home company and its offerings.

When Should I Use CDD?

There are many cases where CDD will work great, and many cases where it will not. So it’s important to recognize the difference. Here’s when you should consider using CDD for a problem space.

There will be an unknown number of multiple iterations of similar use cases within a large or unbounded problem space. For example:
1. A user-customizable widget-based dashboard
2. An Extract, Transform, Load (ETL) pipeline handling multiple similar data sources
Applications can differ in certain behaviors which can be extracted into common components. They should resemble Lego blocks that are built into bigger forms. For example:
1. Polling a data source every 30 minutes vs. accepting data change notifications
2. Running a subset of available widgets
3. Accepting input from command line vs. web form
These common components can be reused with small configuration parameter changes. For example:
1. I/O file format
2. Website URL
3. Access credentials
4. Data aggregation method
It’s desirable to allow non-developers to update and maintain the applications as use cases form and change.

When Should I Not Use CDD?

Mirroring the points in the section above, this is when you would not want to use CDD:

If the number of use cases is known and small, or if there is an insignificant overlap between use cases, CDD likely will be more effort than it’s worth.
If the use cases do not have significant behavioral differences, you may be able to write one implementation with normal configurable parameters.
If there are no configurable parameter differences in the components, that may be OK. Or it may mean that the behaviors aren’t complex enough to warrant use of CDD.
If you don’t intend for non-developers to be able to update the applications, that is by no means a showstopper. But you would want to ask yourself whether CDD is worth it.

Why Should I Use CDD?

If the problem space meets the criteria above, CDD can be an excellent choice as opposed to the alternatives such as copy-paste development or messy/complicated OOP design. These are some of the benefits.

There should be almost no repetition of code if organized efficiently.
New applications can be added very easily to solve new use cases.
There’s no need to worry too much about future-proofing, as new components and parameters can always be added on later.
It’s easy to allow non-developers to create and maintain their own applications without any code.

How Do I Do It?

You’ve decided to try CDD, now here’s the answer to the next obvious question: how do I do it?

Configuration File

The configuration file is the main aspect, hence namesake, of Configuration Driven Development. The specific layout of the file is up to the developer, but in general I find it nice to organize things by encapsulating all relevant parameters in a sub-configuration that can then be passed to the specific component. You should set a format and stick to it, otherwise the file will quickly become unreadable.

Language Options

JSON - A popular choice for file format, due to its ubiquity and usefulness for configuration updates over the web
XML - I strongly advise against it, because it’s bad. Seriously. Please don’t use XML unless you have to!
YAML - My current go-to format, mostly due to its conciseness over JSON and ability to have comments
TOML - Another modern configuration markup language that seems to be growing in popularity

Example Configuration File in YAML

Use Case 1:
   process_stages:
      - extract:
           method: extractMethodA
           options:
              extractMethodAOption1: false
              extractMethodAOption2: 3.14
      - transform: ~ # etc., etc...
      - load: ~ # da da da...
Use Case 2:
  ~ # and so on, and so on...

Core Components

The core components are the most important aspect to get right, following the Goldilocks Principle. Too specific and they end up not being reusable; too general and they become hard to maintain and configure. As with other architectural decisions, you will have to use discretion on a case-by-case basis to determine the appropriate size and scope of components.

How should these components be organized? Well, here’s the hacky spaghetti-code way which, if I’m being honest, is usually my first approach in the spirit of rapid prototyping.

if subconfig['method'] == 'extractMethodA':
    # do stuff, method A style
elif subconfig['method'] == 'extractMethodB':
    # do other stuff, method B style

A first round of refactoring might lead to something a little cleaner, with actual functions

if subconfig['method'] == 'extractMethodA':
    performExtractMethodA(subconfig['options'])
elif subconfig['method'] == 'extractMethodB':
    performExtractMethodB(subconfig['options'])

If the code is going to reach production or at least stick around for a while, it might be worthwhile to organize things a bit better with the design pattern of your choice. This can allow us to achieve even greater reusability in our code modules.

class ExtractMethodA(BaseExtractor):
    # class implementation
    pass

...

if subconfig['method'] == 'extractMethodA':
    extractor = ExtractMethodA(subconfig['options'])    
elif subconfig['method'] == 'extractMethodB':
    extractor = ExtractMethodB(subconfig['options'])
extractor.execute()

Iterative Approach

The key to this philosophy is to not overdesign the components at the start - it must be an iterative approach or else you lose many of the benefits of using CDD. Here’s an example workflow of iterating on a CDD project over time:

Receive a few initial use cases within the problem space.
1. Break down the use cases into processing stages, and then further into behavioral components.
2. Build behavioral components, utilizing configurable parameters liberally.
3. Write unit tests for each component.
4. Build a simple application runner that runs through processing stages and assembles behavioral components.
5. Compose the initial applications in the configuration file.
6. Write/perform functional tests of the whole application flows.
7. Document stages, components and their accepted configuration parameters.
Receive additional use cases or changes to existing.
1. Reuse existing behavioral components with additional configuration parameters if needed. Refactor as you go.
2. Determine completely new behavioral components required; repeat steps 1.2 and 1.3.
3. Make changes/additions to the configuration file to compose the applications.
4. Repeat steps 1.6 and 1.7.
Repeat step 2 ad nauseum!

Testing

A note on testing. You can see in the iterative approach above, it’s mentioned several times. This is important because in order for the components to be generally useful, they need to work as advertised with any configuration, not just the initial application composition that drove development of the component.

Example - COVID-19 Case Data Web Scraper

Finally, I want to share a real example of a personal project in which I employed this technique (writeup to come in future post).

At the very beginning of the COVID-19 pandemic, case and death data per state and county was hard to come by - there was no definitive source and each county/state had its own collection and reporting methods. I decided to write a web scraper that searched state health department websites for case data (Problem Space).

COVID-19 Scraper

I began with a few states of interest, but quickly realized having 57 sites (US states + territories) to scrape would mean many iterations of the same few scraping techniques but with slight tweaks each time (Use Cases). In addition, states were often making subtle or even drastic changes to their delivery method; as such, it was great to be able to fix it with a simple configuration parameter change, and only a rare code addition if they did something completely new.

Behavioral Component Breakdown

Of course, this list built up over some time, but here is the breakdown of the Components I ended up with, organized by Stages of “Content Extraction” and “Scrape Method”.

Site Content Extraction Stage
- Normal GET request
- Pre-render JavaScript on a page
- Establish and insert session ID (used for Tableau - a future blog post in the works!)
Scrape Methodology Stage
- File (CSV) download
- HTML <table> element parse
- HTML page text parse
- API call
- PDF text scrape
- Image text using optical character recognition (OCR)

Code Sample

Below is a pared-down version of the Application Runner. Note that scrape and extract methods were registered to be easily called given the corresponding method name. Also note how you can clearly see the two stages of “extract” and “scrape” being performed in sequence.

with open('stateConfig.yml') as configFile:
    configs = yaml.safe_load(configFile)

states = list(configs['states'].keys())
for state in states:
    stateConfig = configs['states'][state]
    
    # Extract
    extractFunc = getExtractFunc(stateConfig['extract'])
    pagecontent = extractFunc(stateConfig['url'], stateConfig['extractParams'])
    
    # Scrape content
    scrapeFunc = getScrapeFunc(stateConfig['type'])
    return scrapeFunc(stateConfig['scrapeParams'], pagecontent)

One of the simplest scraping functions (Components) was reading the content as a CSV into a pandas dataframe. One Parameter option used was the number of rows to skip in the footer of the file.

def scrapeCsv(scrapeParams, state, pagecontent):
    footerRowsToSkip = getOrDefault(scrapeParams, 'footerRowsToSkip', 0)
    df = pd.read_csv(StringIO(pagecontent), skipfooter=footerRowsToSkip)
    countyCol = getOrDefault(scrapeParams, 'countyCol', 'County')
    casesCol = getOrDefault(scrapeParams, 'casesCol', 'Cases')
    columnRename = dict(zip((countyCol, casesCol), ('County', 'Cases')))
    df.rename(columns=columnRename, inplace=True)
    return df

Configuration File Sample

Here is a subsection of the Configuration file for the state of California which was providing a CSV file for case data at the time (site no longer operational).

---
states:
  California:
    type: 'csv'
    url: 'https://data.chhs.ca.gov/dataset/download/covid-19-data.csv'
    scrapeParams:
      countyCol: 'County Name'
      casesCol: 'Total Count Confirmed'

Conclusion

Configuration Driven Development (CDD) can be a powerful tool to promote reusability and rapid development, when employed correctly and in an appropriate problem space.

I have used it successfully more than a few times, and continue to look for problem spaces in my work where I can apply the principles. I highly recommend doing the same!