I had occasion, recently, to look at some code one of my younger colleagues
had produced. The code works. It's clear. Future maintenance will be
straight-forward if a little tedious. It provides a very useful and non-trivial
feature. But it is really—extravagantly, even—wordy. The engineer
who wrote it knew that: he left a comment to the effect that it was ugly but he
didn't know a better way to accomplish the task.
I do know a better way, so this post is intended to put the
solution out there for those who need it.
Case study
The feature is simple:
Detect what columns of a CSV file contain specific
data by searching the header for pre-defined strings, and subsequently look up
that specific data for use while processing the remaining rows.
With which we can ingest the same data from files written by different producers
that disagree on what extra columns should be present and what order any of them
should be in. At least the headers fields are consistent across all the
writers.
To do this we need some storage that will hold the indices (or indicate that
columns are not present, so we must support a not-applicable value), and we need
some code that walks the list of header fields examining each string and setting
the appropriate index (or skipping with a diagnostic if the header is
unrecognized). Some of the indexes are required to be set, but we check that
after we've set all the ones we find.
A naive approach would be (a) define a named variable for each header index
we want to capture, and (b) use a a big if-else if
chain (with the terminal else handing the unknown header case) for
the "set the appropriate indices" bit.1 That is something like this
//...
int statusIdx = -1; // -1 for "not found"
int thingIdx = -1;
std::vector<std::string> headerFields; // Setup somehow
for (size_t i=0; i<headersFields.size(); ++i)
{
headerText == headerFields[i];
//..
else if (headerText = "Status")
{
statusIdx = i;
}
else if (headerText == "Thing")
{
thingIdx = i;
}
//...
}
As I said above this works, but it's not ideal. Let's start by talking about
what is probably the smallest issue: it's long. with that big chain of
conditionals living inside the loop over fields (possibly inside two loops if
we've built the "iterate over lines" behavior into the same function). And
because each index has it's own named variable running "extract-function" on
the code would result in an unreasonable argument list for the new
routine.2
The bigger problem is that the association of the string names with the
related indices only exists in the form of the comparison statements in that big
chain of conditionals (that is, far from the variable declarations), so you end up
scrolling around to understand what's going on. Good naming can help here,
but...
To simplify our life we make the connection between desired header string
and the index storage explicit. In the simplest form that looks like
std::map<std::string, int> indices {
//...
{"Status", -1},
{"Thing", -1},
//...
}; though we could elaborate this in various ways. Having done that, our index setting code becomes a loop3 for (size_t i=0; i<headerFields.size(); ++i)
{
const auto it = indices.find(headerFields[i]);
if (it == indices.end())
{
LOG_INFO("Skipping unknown CSV header field " + headerField[i]);
continue;
}
it->second = i;
}
which is easily extracted to a named function and new supported fields can be
added by extending the indices map.
This is much better.
Mind you, it still isn't great and depending on how long the rest of the file processing loop is we may want to elaborate on the system a little. We only use the indices to look up fields in subsequent lines, so rather than writing dataField[indices.at("Thing")] every time, we might want to provide a lookup-field-in-line routine that hides the icky syntax. But that's just gravy at this point.
Elaborations
Extending this is easy, and up to a point is a good idea. Though, you can
over-do it. Bring your common sense.
In the case study above the first thing I would note is that our read
requires certain columns to be present or the data can't be used. So I want to
associate that info with the rest of the data. And my typical approach to that
would look something like this:
struct columnData
{
int index;
const bool required;
};
std::map<std::string, columnData> indices = {
//...
{"Status", {-1, false}},
{"Thing", {-1, true}},
//...
};
int columnIndex(const std::string & key, const std::map<std::string, columnData> &headerData);
bool columnRequired(const std::string & key, const std::map<std::string, columnData> &headerData);
if I wanted to use free functions or if I preferred a class it might be
class HeaderInfo
{
struct columnData
{
int index;
const bool required;
};
std::map<const std::string, columnData> indicies;
public:
columnData(std::initializer_list<std::map<std::string, columnData>> l) : indices(l) {}
bool has(const std::string &key) const;
int index(const std::string &key) const;
bool required(const std::string &key) const;
};
with the initialization data provided in the read-CSV routine by way of that
initializer list constructor.
Critically I don't use two separate data stores that can get out of
sync. All the information about our header handling is collected in one place
where it can be edited without confusion if needed. In practice many of my data
stores in this form go a long time between edits.
Nomenclature
I waffled a bit on the title for this post because I'm uncertain what other
people would want to call this approach. There are several obvious candidates
out there: "Data-driven design", "data oriented design", and "declarative".
In my mind, the first two are used for the idea of paying attention to
access patterns and cache behavior in the way you lay data out in
memory. They're a group of optimization techniques, which is not what we're
about here.
I went with "declarative" in the title because I think it get much closer to
my intent here (the programmer says what they want and counts on the
infrastructure to make it happen), but I have reservations in the sense that
C++ has very little declarative nature: we have to engineer to the relationship
between declaration and behavior each and every time. In this case, that's
pretty easy,4 but it can add up to a lot of code if you push the idea
hard.
A tooling issue
In this case we want the indices to reset every time we call the "read a
csv file" routine, so the data map is scoped to that routine, but I often use
some variant of this pattern for static data, with the map placed in global
storage. Which valgrind's memcheck tool doesn't appreciate.
You see, various containers in the C++ standard library (definitely
including std::map) store some of their data outside the class
proper. With any static data store you make this way, those allocations are
created on the heap at runtime (though before main is invoked) and
they persist until the program ends. But "dynamic allocation still extant at
the time the program ends" is one of the things that memcheck detects as a
memory leak.
In light of my on-going goal of (not to say obsession with) seeing
completely clean reports from compilation, static analysis, leak checkers and so
on, this is an annoyance.
You can work around this. Rather than std::map use an
array (either c-style or the C++ container) of structured data, and provide your
own search abstractions. Rather than std::string use string
literals. And so on.5 Or just live with it; by keeping a list of
"known okay" reports you can minimize the time wasted on this kind of false
positive.
1 In a language with match or a more
flexible switch-case construct you could use that
instead of the if-else if chain, but C++ is not our
friend on that front.
2 The "hard to refactor" bit could be solved by collecting the
indices in a single named object struct indices { /*...*/ int statusIndex;
int thingIndex; /*...*/}; but we're going to end up doing better than
that.
3 Written here in a "modern" but pre-C++20 form because (a) I'm
still using C++17 at work so for me maps still don't have contains
and (b) because if you did use contains you'd have the
search the map twice the in (common!) case where the field is a known.
4 When I tried a full implementation of the elaborated class
(just because I wanted to get it right) it came to about 40 lines of code.
5 Or try living in the present: newer standards are
supporting constexpr for much more of the standard library,
including strings and containers. However that works under the hood, the
allocations are not runtime heap entries, so memcheck should be
happy.