bryanlarsen 7 days ago

By documentation I assumed you meant internal documentation, like on a company Wiki.

External documentation is presumably already in the LLM's training data, so it should be extraneous to pull it into context. Obviously there's a huge difference between "should be" and "is" otherwise you wouldn't be putting in the work to pull it into context.

1
electroly 7 days ago

I'd guess the breakdown is about:

- 80%: Information about databases. Schemas, sample rows, sample SQL usages (including buried inside string literals and obscured by ORMs), comments, hand-written docs. I collect everything I can find about each table/view/procedure and stick it in a file named after it.

- 10%: Swagger JSONs for internal APIs I have access to, plus sample responses.

- 10%: Public API documentation that it should know but doesn't.

The last 10% isn't nothing; I shouldn't have to do that and it's as you say. I've particularly had problems with Apple's documentation; higher than expected hallucunation rate in Swift when I don't provide the docs explicitly. Their docs require JavaScript (and don't work with Cursor's documentation indexing feature) which gives me a hunch about what might have happened. It was a pain in the neck for me to scrape it. I expect this part to go away as tooling gets better.

The first 90% I expect to be replaced by better MCP tools over time, which integrate vector indexing along with traditional indexing/exploration techniques. I've got one written to allow AI to interactively poke around the database, but I've found it's not as effective as the vector index.