Item 44128944

ch4s3 • 8 days ago

It seems like LLMs made really big strides for a while but don't seem to be getting better recently, and in some ways recent models feel a bit worse. I'm seeing some good results generating test code, and some really bad results when people go to far with LLM use on new feature work. Base on what I've seen it seems like spinning up new projects and very basic features for web apps works really well, but that doesn't seem to generalize to refactoring or adding new features to big/old code bases.

I've seen Claude and ChatGPT happily hallucinate whole APIs for D3 on multiple occasions, which should be really well represented in the training sets.

soerxpso • 7 days ago

> hallucinate whole APIs for D3 on multiple occasions, which should be really well represented in the training sets

With many existing systems, you can pull documentation into context pretty quickly to prevent the hallucination of APIs. In the near future it's obvious how that could be done automatically. I put my engine on the ground, ran it and it didn't even go anywhere; Ford will never beat horses.

1 reply

prisenco • 7 days ago

It's true that manually constraining an LLM with contextual data increases their performance on that data (and reduces performance elsewhere), but that conflicts with the promise of AI as an everything machine. We were promised an everything machine but if we have to not only provide it the proper context, but already know what constitutes the proper context, then it is not in any way an everything machine.

Which means it's back to being a very useful tool, but not the earth-shattering disruptor we hoped (or worried) it would be.

2 replies

munksbeer • 7 days ago

>Which means it's back to being a very useful tool, but not the earth-shattering disruptor we hoped (or worried) it would be.

Yet?

1 reply

prisenco • 7 days ago

That could require another breakthrough. Or ten more.

Fun to consider but that much uncertainty isn't worth much.

roywiggins • 7 days ago

Depends on how good they get at realizing they need more context and tool use to look it up for you.

1 reply

prisenco • 7 days ago

How would they reliably recognize the context needed without the necessary context?

1 reply

roywiggins • 7 days ago

in the case of hallucinating a library, give it access to an IDE's autocomplete or type checker or whatever so it can check if the functions it thinks exist actually do, if they don't, start feeding it documentation or type info about the library until it spits out stuff that type checks

For other stuff this is obviously harder.

empath75 • 8 days ago

the LLM's themselves are making marginal gains, but the tools for using LLMs productively are getting so much better.

1 reply

dinfinity • 7 days ago

This. MCP/tool usage in agentic mode is insanely powerful. Let the agent ingest a Gitlab issue, tell it how it can run commands, tests etc. in the local environment and half of the time it can just iterate towards a solution all by itself (but watching and intervening when it starts going the wrong way is still advisable).

Recently I converted all the (Google Docs) documentation of a project to markdown files and added those to the workspace. It now indexes it with RAG and can easily find relevant bits of documentation, especially in agent mode.

It really stresses the importance of getting your documentation and processes in order as well as making sure the tasks at hand are well-specified. It soon might be the main thing that requires human input or action.

2 replies

ch4s3 • 7 days ago

Every time I’ve tried to do that it takes longer than it would take me, and comes up with fairly obtuse solutions. The cursor agent seems incapable of putting code in the appropriate files in a functional language.

max_on_hn • 7 days ago

I 100% agree that documenting requirements will be the main human input to software development in the near future.

In fact, I built an entirely headless coding agent for that reason: you put tasks in, you get PRs out, and you get journals of each run for debugging but it discourages micro-management so you stay in planning/documenting/architecting.

oconnor663 • 7 days ago

> don't seem to be getting better recently

o3 came out just one month ago. Have you been using it? Subjectively, the gap between o3 and everything before it feels like the biggest gap I've seen since ChatGPT originally came out.

1 reply

ch4s3 • 7 days ago

I haven't used it extensively, but toyed around with it for Elixir code and I wasn't particularly impressed.