2024-04-14

Neal Stephenson has it wrong

For noreason I can identify I suddenly noticed something today:

They're called "emojis", not "mediaglyphics".

Of course, he pretty much nailed everything except the name.

2024-04-04

Yeah, well, you know, that’s just, like, your workflow, man.

I caught some flack at work this week: I circulated an early draft of a document that I was struggling with in plain text1 and my boss was very clear that he wanted me to use Word in the future so there would be change tracking and out-of-band comments. On the plus side those remarks came packaged up with some useful suggestions for the piece.

Once I tamped down my reflexive defensiveness and the basic anxiety that comes with screwing up at work, I pulled up my big kid underwear and moved on. Then, having decided to be an adult about this, I ran smack dab into a counter example for $BOSS's point. I received a second set of highly useful changes in the same document. Conflicting changes. I'm not aware of any good tooling to handle conflicting changes in Word, but it was no problem for me to handle the conflicts in my text document: I just opened the files in my favorite visual merge tool.2 and got on with it.

Caveat time. To take the "plain text means we can use good tools" thing seriously we'd want to put all our draft work in VC repositories, and when that occurred to me my first reaction was "Who'd want to do that?" I mean, yeah that makes sense for major pieces of writing, but it's not obvious that you want to maintain a full history on every minor document you bang out day in and day out.

But then I had another thought...

Caveat on the caveat. Which was "Hey, how do people who are really committed to Word deal with the possibility of conflicting changes, anyway?" A little poking around the web suggest that my employer's answer is completely mainstream. At the management tier we put everything in SharePoint and let it enforce serialized editing, so they're already putting all their work in a repository. Maybe the whole idea isn't so silly after all.


1 Now, I would never send plain text to the clients, but I often do my initial composition in text because the sense of informality helps me feel safe trying out different formulations in search of a natural arc through complex subjects.

2 Meld as it happens. But not because I've tried all the options: it was just the first one I spent any time with and it's been consistently available.

2024-03-21

Algorithms header knowledge check

Note some late additions marked with *.

* After posting this I began to feel it needed a little bit more detail. And then that it needed quite a bit more detail.


The C++ standard library's algorithm header has a routine sutiable for counting the number of places of disagreement between two equal sized collections of elements. What's it called? Hint: it's not called "count_differences". *Nor is it called count_if (unless you happen to be using C++-23 because count_if doesn't have an overload for parallel containers.0) My work projects are in C++-17 and my home projects in'17 or '20.

Answer
inner_product

*One approach is to write a "sum" function that adds just as in the usual inner product, but the make the one that would do element multiplication in the standard inner produt return zero when the two values are equal and one if they differ.

For that matter, what educational backgrounds would prepare you to recognize that as the routine you want?1 How does this compare to Kate Gregory's story about partial_sort_copy and how it would be better called top_n?


*0 C++23 doesn't introduce a parallel overload either, but it does introduce zip_view and zip which will allow you to efficiently produce on single container of apparent pairs from the parallel containers. Then you can use the single-container version of count_if. Obvious. Right?

1 My combination of physics and prior experience with the algorithm header's love affair with having a user-supplied-predicate-to-change-the-behavior overload meant that I spotted it as soon as I read the name, but ... that's a rather esoteric requirement for user's to know what they're seeing.

2024-03-05

Voice-assistant fails

Accumulated over the years, but I got another one the other day that triggered me.

Me:
[Navigating to a business in the US sowthwqest]
Creppy voice assistant (CVA):
In five-hundred feet, turn left on El Camino Real Street1.
Me:
[::Sighs::]


CVA:
[Interrupts a conversation in the car]
Me:
Hold your horse, [CVA].
CVA:
[Starts reading the Wikipeia article on the idiom]


Me:
[CVA], play 2112.
CVA:
Now playing two-thousand-one-hundred-twelve.
Me:
[::Fumes:: until the music sweeps me away]

It's all about context seneitivity. Or the lack thereof.


1 With "Real" pronounced as a single sylable. Of course.

2024-01-28

Career opportunity

Desperately seaking a licensned professional to tell us that we're making good parenting choices.

2024-01-25

Bringing C Structs into the C++ Lifetime Model

In addition to legacy code in our own projects, I sometimes "get" to work against libraries (legacy or modern written in plain C. Which is OK. I learned C a long time ago and I'm not intimidated by it, though it can take a while to get back into the right mindset. Of course, there are things I miss. Static polymorphism and namespaces, for instance, are pretty small conceptual changes with significant convenience factor for the programmer.1

Now, C++ has a reputation as being a dangerous language where it is easy write really broken code. That impression is not wrong, but it is incomplete: the lagnuage also offers features that support writing code that has enforced safety in some aspects. It's not trivial and it takes both discipline and some understanding of how the features work, but in my opinion it takes less discipline to write memory-safe C++ code than memory-safe C code.2

This article covers one way to bring a C struct into the C++ lifetime model to leverage the better (or at least more automatic) memory safety of C++ library primitives.

We start with a highly artificial example struct designed to be a pain memory wise:3

 struct thing {
    int i;
    double d;
    char *s
    int *ary;
};

Each of the pointers pose us some (interrelated) questions:

  • Where do the objects that will be pointed to live? Heap? Stack? Data segment? Global memory? Memory mapped file? Something really exotic?
  • How do we ensure that the pointer is not used after the objects go away (if they go away)?
  • If they exist on the free-store, how do control deallocation?
The questions aren't unique to C, they are the same ones that must always be dealt with. But C code deals with them all every time, while other languages may have built-in answers to some of them.4

Nor can you necessarily answer the questions by static examination of the code, but in the case I faced at work, both pointers were consistently pointing at dynamically allocated objects. Moreover the number we needed could not be determined at compile time, so we were storing the thing *'s in a vector.

We had an existing C function thing *newThing(size_t array_size, const char *label) which would create a new struct thing on the heap (with a alloc family function), set default values of i and d, set the string and allocate (but not populate) the array and return the pointer to the thing. This is analogous to a C++ constructor, but for some reason (history, no doubt) we were handling the three calls to free manually each time we needed to reap one of these things.

Then we did something roughly like this:


{
    std::vector<const thing*> thing_list;
    for (const auto &input : inputs)
        list.push_back(newThing(input.name, input.size()));
    process_list(thing_list);
}

Which, of course, loses three heap allocated objects for every item in the inputs container.

Replicating a proper, but C-like, approach to memory management here would mean writing a destructor-analog (perhaps void reapThing(thing *p)) as a free function and inserting std::for_each(thing_list.begin(), thing_list.end(), reapThing); before the closing brace. That works and I wouldn't be displeased to see it in a legacy project like the one I'm working on, but I think we can do a little better.

The "doing better" interface is actually quite simple:5


#include "thing.h"

struct thing_wrapper : public thing
{
    thing_wrapper();
    thing_wrapper(size_t array_size, std:string_view label);
    thing_wrapper(const thing_wrapper &);
    ~thing_wrapper();
    
    thing_wrapper &operator=(const string_wrapper &);
}

The wrapper has the same data, but manages the sub-allocations for you. What complexity there is lies in ensuring that the constructors, assignment operators and destructor all agree on memory management of the sub-allocations.6 You might also want to add a constructor and assignment operator taking a const thing &, but this is a leap of faith insofar as nothing will enforce a consistent allocation strategy on those inputs. Similarly you can consider supporting move operations if you have a particular use for them.

With the wrapper in place we can change the original code to something like:


{
    std::vector<std::unique_ptr<const thing>> thing_list;
    for (const auto &input : inputs)
        list.emplace_back(std::make_unique<thing_wrapper>(input.name, input.size()));
    process_list(thing_list);
}

With no need, now, for explicit clean-up code.


1 Oddly, neither of these is trivial to add because they imperil the universal linkability of C (which depends on not needing a vendor-dependent name-mangling scheme).

2 At the foundational level, it is the object lifetime model that supports this, and at the practical level it is exploited in the standard library which offers a more powerful set of primitives than the C standard library. Step one for writing a robust C program at scale is to get a more robust library (which you might be able to get off the shelf or might want to write yourself).

3 It is, however, analogous to the problem I faced at work today.

4 Many "managed" languages have everything on the free-store, and use a garbage collector to resolve the lifetime question.

5 I've chosen to make this a struct rather than a class for two reasons. First because the whole interface we want to derive is public: we're not going to extend thing in any way beyond supporting the C++ lifetime model. Second because of Core Guideline C2: the C code enforces no invariant so we don't add one.

6 The safe thing to do, is use the facilities used by the code that provides the underlying structure, which in the case of pure c libraries usually means *alloc/free or some wrapper around the same. You may be able to defer to any pre-existing C functions that perform the set-up and tear-down.

2023-12-29

The limits of "Fix it when you touch it."

My main projects at work have been on minimal-spend for a few months which means I've been shifted to some feature adds for our biggest product. This thing goes back to the mid-eighties and is coded in C (updated to ANSI syntax, at least), Fortran (updated to f90, at least), C++ (with the standard containers, at least, but lots of it predates the "modern" era), and python (recently ported to python3, at least). So, yeah, it has all the issues you'd expect in a legacy codebase. Some of them in spades.1

We're basically a contract shop, so we don't do "Let's fix this entire module because it's grotty enough to be a pain", because who would pay for that? On the other hand, we really would like to have nice code, so we have a "You can fix issues with the bits you touch." policy.

Not complaining about that. My last feature add actually removed net lines of code because I replaced some really wordy, low-level stuff with calls to newer library features and factored some shared behavior into utility code to reduce repetition. So the policy makes me a happy (and perhaps even productive) programmer.

But it has it's limits. The short version is some legacy issues span a lot of code and you can't fix them locally.

Case Study

The bit of code I'm working on right now has a peculiar feature: in several places I find a std::vector<SomeStruct> paired with a count variable.2 They appear in an effectively-global3 state object and they're passed to multiple different routines as a pairs. This is very much not what you'd expect in code originally written in C++.

Sometime in the past this was almost certainly a dynamic array coded in plain 'ol C. And not even a struct darray {unsighned count; SomeStruct * data}; one paired with some management functions, but a bare, manage-it-yourself-you-wimp pairing of a count and a pointer.4 But why wasn't the count discarded when transitioning to std::vector?

Finding out requires a lot of tedious, close reading of code where the pairs are used. And, of course, seeing the old C code behind the current C++.

There are several places where the vector is resized (meaning multiple extra entries are added to the end in one fell swoop) to some "bigger than we need" value. Those extra entries are filled with default data, which the code then overwrites one at a time with freshly calculated "actual" data. This is (a) a performance optimization insofar as it prevents the possibility of multiple re-sizes and copies that exists if you added entries incrementally, (b) a fairly faithful transliteration of what would have done with the dynamic arrays in C, and (c) the wrong way to perform the trick with std::vector.5

When this was originally done in C, the "count" variable would have tracked how many entries had "good" data while the allocated size would have been known because you knew the maximum expected size was. But the container doesn't know that you want to do that and it's size() method will always return the number of objects it has (including the default valued one). So the manual count was still needed in places, and they kept the (effectively) global version because it was easier.

Result: getting rid of the extraneous count variable means fixing half-a dozen routines elsewhere in the project and I end up touching scores of lines in a dozen files. That's not "fix what you are touching anyway".

Takeaway

Some legacy maintenance is too big for purely local fixes.

Today was actually the second one of these I've looked at in the last couple of months. I was able to fix the first one mostly with global search-and-replace and only touched five files; I felt that was "local" enough for the payoff in terms of making the code more comprehensible. So I was optimistic on this one, too, but it quickly grew out of hand. None the less, I may be finished and if the regression tests are clean I'm going to commit it.


1 But, honestly, I worked with worse in my physicist days.

2 For those who don't do C++, the standard vector container is a extensible array-like data-structure. It maintains its own count.

3 Possibly a subject for another day, and another example of something you can't re-factor in the small.

4 In the Bad 'Ol Days, a significant number of programmers would begrudge the cycles lost to function calls for that kind of things when it was "easy" to do it inline at each site. Of course, they could have used a macro DSL for the purpose, but those are tricky and (even then) had a mixed reputation.

5 It's wrong for two reasons in general. The less important one here is that it default constructs the new values which can take cycles (in a C dynamic array using realloc means you just get whatever garbage was in the memory occupied by the new spaces so you don't pay for that). The more important issue here is that vector has reserve which unlike resize just makes sure you'll have room for the new stuff if you want to use it, meaning you can then emplace_back for the best of both worlds.