Not so long ago Deepseek-R1 dropped to much wittering and gnashing of
teeth.
Im my desultory way, I eventually got around to downloading several of the
medium scale models to run locally. The 70B parameter model achieves a token or
two a second on my framework laptop which is good enough for testing even if I
wouldn't try to use it in a productive work flow. My comments here are based
mostly on local interactions with that model, though I have poked at it's
slightly small cousins. I haven't interacted with the 600B+ parameter variant at
all.
My methodology (if you can will accept such a formal sounding word)
continues to be pretty ad hoc: I repeatedly fire up the model in ollama,
present the model with a single prompt or a short series of related prompts,
read the answers then shut down the model before going onto the next
prompt(set). As before my prompts are mostly intended to probe the edges of what
I expect these models to do well on by choosing subject where I expect one of
the following to apply:
- The topic is sparesly represented on the internet.
- The topic is a combination of two or more well covered ideas, but the
combination is sparsely represented.
- The topic is a highly specific example taken from a very broad
field.
I have added another movie related query that I expect models to do well on. I
ask them to explain how the rooftop confrontation in Blade Runner makes Roy
Batty a sympathetic and human figure; this is ground that many a fawning essay
has covered in detail, and large enough models often write very well on the
subject.
But the idea is mostly to stress the generation process by one way or
anther.
Something I'd noticed even before I began playing with deepseek, but hadn't
mentioned yet is that these models seem to be very bad at keeping track or what
was in specific sources (say a single paper I asked about) and what they only
found in the halo of secondary-source words written about the thing I
asked after. This shows up consistently in how they handle the questions about
physics paper with most models drawing in material that was probably written to
explain the essays to less technical audiences and attributing it to the paper
itself.
Thoughts on Deepseek-R1
At least in the models I've been working with it's good (even for the size),
but it's not great. It's really, pointedly not great.
It has a reasoning model, which gets it partly over a hurdle I saw
in my
earlier post. It was able to actually do something with the weight and
volume limits I suggested for my day-hike survival kit. It can math. Yeah!
Mind you, the inputs it was using for the math were still values it picked
out of it's training set without any obvious understanding of the context. Like
several other models, it keeps insisting that the first-aid kit I should carry
will be just a few ounces, which isn't a bad number for something to put in your
purse or satchel, but the one I actually pack for a hike is the best part
of a pound because it is meant to cover a wider and more serious set of
problems.
It's writing style is clunky and repetitive, and it takes great pains to
show off the presence of that reasoning model, often sounding like a
under-prepared student trying desperately to stretch a cursory understanding
into an assigned page target.1 This stands in contrast to the slow
but notable progress I've been seeing as new models come out from the
established players. Lamma3.3, for instance, produces much more fluent and
readable text for a lot of prompts.
Is the alarmed response justified
Well, OpenAI's annoyance is mostly amusing. To the extent that it
is not amusing it's a warning that [ spunky up-n-comers | tendrils of
evil, acquisitive foreign powers ] don't always play be the rules that [ their
staid elders | the defenders truth freedom and unrestrained capitalism ]
proclaim.2
Leaving that aside, the claimed cost and speed of this work is impressive
and should probably worry the big players. I mean they still have a better
product, but the price-performance situation means that the new guy probably
looks really attractive for a lot of applications.
1 Particularly annoying with the slow production speed. I keep
wanting to shout "You said that already, think of something new or give it a
rest!".
2 As a minor irony, I'll note that the young United States was
criticized by various European powers for failing to respect even the weak
international intellectual property regime then extant. This is a thing with
historical precedents even if we don't like it.