My main projects at work have been on minimal-spend for a few months which means I've been shifted to some feature adds for our biggest product. This thing goes back to the mid-eighties and is coded in C (updated to ANSI syntax, at least), Fortran (updated to f90, at least), C++ (with the standard containers, at least, but lots of it predates the "modern" era), and python (recently ported to python3, at least). So, yeah, it has all the issues you'd expect in a legacy codebase. Some of them in spades.1
We're basically a contract shop, so we don't do "Let's fix this entire module because it's grotty enough to be a pain", because who would pay for that? On the other hand, we really would like to have nice code, so we have a "You can fix issues with the bits you touch." policy.
Not complaining about that. My last feature add actually removed net lines of code because I replaced some really wordy, low-level stuff with calls to newer library features and factored some shared behavior into utility code to reduce repetition. So the policy makes me a happy (and perhaps even productive) programmer.
But it has it's limits. The short version is some legacy issues span a lot of code and you can't fix them locally.
Case Study
The bit of code I'm working on right now has a peculiar feature:
in several places I find a std::vector<SomeStruct>
paired with a count variable.2 They appear in an
effectively-global3 state object and they're passed to
multiple different routines as a pairs. This is very much not
what you'd expect in code originally written in C++.
Sometime in the past this was almost certainly a dynamic array
coded in plain 'ol C. And not even a struct darray {unsighned
count; SomeStruct * data};
one paired with some management
functions, but a bare, manage-it-yourself-you-wimp pairing of a count
and a pointer.4 But why wasn't the count discarded when
transitioning to std::vector
?
Finding out requires a lot of tedious, close reading of code where the pairs are used. And, of course, seeing the old C code behind the current C++.
There are several places where the vector is resized (meaning
multiple extra entries are added to the end in one fell swoop) to some
"bigger than we need" value. Those extra entries are filled with
default data, which the code then overwrites one at a time with
freshly calculated "actual" data. This is (a) a performance
optimization insofar as it prevents the possibility of multiple
re-sizes and copies that exists if you added entries incrementally,
(b) a fairly faithful transliteration of what would have done with
the dynamic arrays in C, and (c) the wrong way to perform the trick
with std::vector
.5
When this was originally done in C, the "count" variable would
have tracked how many entries had "good" data while the allocated size
would have been known because you knew the maximum expected size
was. But the container doesn't know that you want to do that and
it's size()
method will always return the number of
objects it has (including the default valued one). So the manual count
was still needed in places, and they kept the (effectively) global
version because it was easier.
Result: getting rid of the extraneous count variable means fixing half-a dozen routines elsewhere in the project and I end up touching scores of lines in a dozen files. That's not "fix what you are touching anyway".
Takeaway
Some legacy maintenance is too big for purely local fixes.
Today was actually the second one of these I've looked at in the last couple of months. I was able to fix the first one mostly with global search-and-replace and only touched five files; I felt that was "local" enough for the payoff in terms of making the code more comprehensible. So I was optimistic on this one, too, but it quickly grew out of hand. None the less, I may be finished and if the regression tests are clean I'm going to commit it.
1 But, honestly, I worked with worse in my physicist days.
2 For those who don't do C++, the standard vector container is a extensible array-like data-structure. It maintains its own count.
3 Possibly a subject for another day, and another example of something you can't re-factor in the small.
4 In the Bad 'Ol Days, a significant number of programmers would begrudge the cycles lost to function calls for that kind of things when it was "easy" to do it inline at each site. Of course, they could have used a macro DSL for the purpose, but those are tricky and (even then) had a mixed reputation.
5 It's wrong for two reasons in general. The less
important one here is that it default constructs the new values which
can take cycles (in a C dynamic array using realloc
means
you just get whatever garbage was in the memory occupied by the new
spaces so you don't pay for that). The more important issue here is
that vector
has reserve
which
unlike resize
just makes sure you'll have room for the
new stuff if you want to use it, meaning you can
then emplace_back
for the best of both worlds.
No comments:
Post a Comment