No Swamp Coolers: Deepseek-R1 and a few more thoughts on LLMs

Not so long ago Deepseek-R1 dropped to much wittering and gnashing of teeth.

Im my desultory way, I eventually got around to downloading several of the medium scale models to run locally. The 70B parameter model achieves a token or two a second on my framework laptop which is good enough for testing even if I wouldn't try to use it in a productive work flow. My comments here are based mostly on local interactions with that model, though I have poked at it's slightly small cousins. I haven't interacted with the 600B+ parameter variant at all.

My methodology (if you can will accept such a formal sounding word) continues to be pretty ad hoc: I repeatedly fire up the model in ollama, present the model with a single prompt or a short series of related prompts, read the answers then shut down the model before going onto the next prompt(set). As before my prompts are mostly intended to probe the edges of what I expect these models to do well on by choosing subject where I expect one of the following to apply:

The topic is sparesly represented on the internet.
The topic is a combination of two or more well covered ideas, but the combination is sparsely represented.
The topic is a highly specific example taken from a very broad field.

I have added another movie related query that I expect models to do well on. I ask them to explain how the rooftop confrontation in Blade Runner makes Roy Batty a sympathetic and human figure; this is ground that many a fawning essay has covered in detail, and large enough models often write very well on the subject.

But the idea is mostly to stress the generation process by one way or anther.

Something I'd noticed even before I began playing with deepseek, but hadn't mentioned yet is that these models seem to be very bad at keeping track or what was in specific sources (say a single paper I asked about) and what they only found in the halo of secondary-source words written about the thing I asked after. This shows up consistently in how they handle the questions about physics paper with most models drawing in material that was probably written to explain the essays to less technical audiences and attributing it to the paper itself.

Thoughts on Deepseek-R1

At least in the models I've been working with it's good (even for the size), but it's not great. It's really, pointedly not great.

It has a reasoning model, which gets it partly over a hurdle I saw in my earlier post. It was able to actually do something with the weight and volume limits I suggested for my day-hike survival kit. It can math. Yeah!

Mind you, the inputs it was using for the math were still values it picked out of it's training set without any obvious understanding of the context. Like several other models, it keeps insisting that the first-aid kit I should carry will be just a few ounces, which isn't a bad number for something to put in your purse or satchel, but the one I actually pack for a hike is the best part of a pound because it is meant to cover a wider and more serious set of problems.

It's writing style is clunky and repetitive, and it takes great pains to show off the presence of that reasoning model, often sounding like a under-prepared student trying desperately to stretch a cursory understanding into an assigned page target.¹ This stands in contrast to the slow but notable progress I've been seeing as new models come out from the established players. Lamma3.3, for instance, produces much more fluent and readable text for a lot of prompts.

Is the alarmed response justified

Well, OpenAI's annoyance is mostly amusing. To the extent that it is not amusing it's a warning that [ spunky up-n-comers | tendrils of evil, acquisitive foreign powers ] don't always play be the rules that [ their staid elders | the defenders truth freedom and unrestrained capitalism ] proclaim.²

Leaving that aside, the claimed cost and speed of this work is impressive and should probably worry the big players. I mean they still have a better product, but the price-performance situation means that the new guy probably looks really attractive for a lot of applications.

¹ Particularly annoying with the slow production speed. I keep wanting to shout "You said that already, think of something new or give it a rest!".

² As a minor irony, I'll note that the young United States was criticized by various European powers for failing to respect even the weak international intellectual property regime then extant. This is a thing with historical precedents even if we don't like it.

No Swamp Coolers

Pages

2025-02-08

Deepseek-R1 and a few more thoughts on LLMs

Thoughts on Deepseek-R1

Is the alarmed response justified

No comments:

Post a Comment