We've flown out to stay with my in-laws for the best part of a week. Because
reasons.
I brought my work computer in case I need to respond to some crises or just
find some time to do useful work rather than burning PTO (which is always in
short supply). Unfortunately, my work computer is currently an eight-pound,
desktop-replacement beast. For which I really want to bring a spare display (and
I have fifteen-inch, USB HD panel for that purpose). With all the weight and
volume devoted to work computing I opted for the barest minimum of person
hardware: my Raspberry Pi 400, its mouse and, associated cabling to talk to that
same panel.
Now, the
400 is a surprisingly good computer for what I paid for it, but it's also
quite limited. In particular it has only 4GB of RAM, and while the CPU supports
NEON SIMD there is no dedicated graphics processor. It's completely
unsuited to running LLMs without add-on hardware. But I got bored, so I decided
to try anyway.
I was looking for relatively modern models that will fit in RAM and
found Llama3.2:1b (7 months
old) and codegemma:2B (9
months old). One conversation and one code support. Nice.
I've been speculating about down-quantizing some slightly bigger models, but
in the mean time I started playing around with the baby llama. I didn't want to
challenge it with my usual questions for probing the weakness of larger and more
powerful models, so I started by just asking it to tell me about the band
Rush.
As models do, it then produced a bunch of plausible sounding text. Some of
it was incomplete and some was simply wrong. For some reason it thinks Rush won
more major award than they did. It seems to have selected exclusively praising
text to reproduce (which is fine by me: I like the band and don't find most
complaints directed at them to be at all convincing), but there is a non-trivial
amount of criticism and none of it is represented or even acknowledged by the
answer.
Now, let's be frank: there is a limit to how much factual detail you can
expect to be encoded in what is, after all, a somewhat niche cultural subject
when you have only two billion parameters available to try to represent a full
cross section of English language knowledge. No general purpose model of this
size is going to get everything right on a query of that kind. So,
I'm not saying that the fact of errors is surprising or even
interesting. But I am wondering what we can learn from the nature of
the errors.
So let's look more closely at some of the errors. In its "Discography" section
the model lists
- Start with School Days (1974)
- Caress of Steel (1975)
- Fly by Night (1975)
- 2112 (1976)
- Two-Side Streets (1978)
- Permanent Waves (1980)
- Moving Pictures (1981)
- Hold Your Fire (1987)
- Presto (1997)
- Test for Echo (2004)
- Snakes & Arrows (2007)
- Clockwork Angels (2012)
which has both omissions and errors.
I haven't the faintest clue what to make of the title mistake for the first
album (which was 1974 but was self-titled). It's not the title of any Rush song
or album that I'm aware of and a casual search of the web doesn't turn up any
song or album by that name at all. The search does turn up a reasonable
number of hits on Stanley Clarke's 1976 album School Days and a number
of suggestions that people just beginning to explore Mr. Clarke's music should
"Start with" that very same album. Interesting, but not terribly
enlightening.
Nor do I have any theories concerning which albums are omitted. It doesn't
look to me like the list is either the commercial hits or the fan favorites, and
beyond that I don't know what patterns to look for.
But I do want to talk about the 1978 entry in the model's list. The proper
album title for that year is Hemispheres.The thing that strikes me here
is proper title and the substitute text share a conceptual relationship to the
number two (two halves; two ways). My (admittedly wild) guess is that we're
seeing a side affect of the model's attention system identifying "two" as an
important concept to use when trying to infer tokens.
If true that would be interesting, because the attention system is one of
the significant ways in
which LLMs
differ from Markov Chain generators. But it may also be responsible for the
models difficulty in know what is a quotation and what is commentary which
I've already
discussed in the context of scientific papers.