It seems like LLMs made really big strides for a while but don't seem to be getting better recently, and in some ways recent models feel a bit worse. I'm seeing some good results generating test code, and some really bad results when people go to far with LLM use on new feature work. Base on what I've seen it seems like spinning up new projects and very basic features for web apps works really well, but that doesn't seem to generalize to refactoring or adding new features to big/old code bases.
I've seen Claude and ChatGPT happily hallucinate whole APIs for D3 on multiple occasions, which should be really well represented in the training sets.
> hallucinate whole APIs for D3 on multiple occasions, which should be really well represented in the training sets
With many existing systems, you can pull documentation into context pretty quickly to prevent the hallucination of APIs. In the near future it's obvious how that could be done automatically. I put my engine on the ground, ran it and it didn't even go anywhere; Ford will never beat horses.
It's true that manually constraining an LLM with contextual data increases their performance on that data (and reduces performance elsewhere), but that conflicts with the promise of AI as an everything machine. We were promised an everything machine but if we have to not only provide it the proper context, but already know what constitutes the proper context, then it is not in any way an everything machine.
Which means it's back to being a very useful tool, but not the earth-shattering disruptor we hoped (or worried) it would be.
Depends on how good they get at realizing they need more context and tool use to look it up for you.
How would they reliably recognize the context needed without the necessary context?
in the case of hallucinating a library, give it access to an IDE's autocomplete or type checker or whatever so it can check if the functions it thinks exist actually do, if they don't, start feeding it documentation or type info about the library until it spits out stuff that type checks
For other stuff this is obviously harder.
the LLM's themselves are making marginal gains, but the tools for using LLMs productively are getting so much better.
This. MCP/tool usage in agentic mode is insanely powerful. Let the agent ingest a Gitlab issue, tell it how it can run commands, tests etc. in the local environment and half of the time it can just iterate towards a solution all by itself (but watching and intervening when it starts going the wrong way is still advisable).
Recently I converted all the (Google Docs) documentation of a project to markdown files and added those to the workspace. It now indexes it with RAG and can easily find relevant bits of documentation, especially in agent mode.
It really stresses the importance of getting your documentation and processes in order as well as making sure the tasks at hand are well-specified. It soon might be the main thing that requires human input or action.
Every time I’ve tried to do that it takes longer than it would take me, and comes up with fairly obtuse solutions. The cursor agent seems incapable of putting code in the appropriate files in a functional language.
I 100% agree that documenting requirements will be the main human input to software development in the near future.
In fact, I built an entirely headless coding agent for that reason: you put tasks in, you get PRs out, and you get journals of each run for debugging but it discourages micro-management so you stay in planning/documenting/architecting.
> don't seem to be getting better recently
o3 came out just one month ago. Have you been using it? Subjectively, the gap between o3 and everything before it feels like the biggest gap I've seen since ChatGPT originally came out.
I haven't used it extensively, but toyed around with it for Elixir code and I wasn't particularly impressed.