Item 44107708

noosphr • 7 days ago

I build these systems for a living, and I just made a post about why code is different from natural text: https://news.ycombinator.com/item?id=44107379

RAG is useful for natural text because there is no innate logic in how it's structured. RAG chunking based on punctuation for natural language doesn't work well because people use punctuation pretty poorly and the RAG models are too small to learn how they can do it themselves.

Source code, unlike natural text, comes with grammar that must be followed for it to even run. From being able to find a definition deterministically, to having explicit code blocks, you've gotten rid of 90% of the reason why you need chunking and ranking in RAG systems.

Just using etags with a rule that captures all the scope of a function I've gotten much higher than sota results when it comes to working with large existing code bases. Of course the fact I was working in lisp made dealing with code blocks and context essentially trivial. If you want to look at blub languages like python and javascript you need a whole team of engineers to deal with all the syntactic cancer.

esafak • 7 days ago

RAG does not just mean similarity search. It means retrieving all relevant content, including the AST dependencies. Whatever you would want to know if you were to answer the query yourself.

1 reply

noosphr • 7 days ago

Than it must be able to search every book and paper ever written because when it comes to deciding if an algorithm is correct I need to read the original paper that defined it and any updates in the literature since then.

Since that rag system doesn't, and probably will never, exist we are stuck with vector embeddings as the common definition everyone working in the field uses and understands.

1 reply

esafak • 7 days ago

If you were to do this by hand, would you search every book and paper ever written? That is not feasible so you have to make a trade-off.

For alternatives to vector search, see GraphRAG and AST parsing; e.g., https://vxrl.medium.com/enhancing-llm-code-generation-with-r... or https://github.com/sankalp1999/code_qa

1 reply

noosphr • 7 days ago

That's what google scholar is for. Use it to find the meta analysis papers and go from there.

Which incidentally shows why RAG just means vector store + embedding model, since your definition means different things to different people and an implementation can't exist until we figure out AGI.