I generally agree with the article and the approach given practical constraints, however it’s all stop gap anyway.
Using Gemini 2.5’s 1MM token context window to work with large systems of code at once immediately feels far superior to any other approach. It allows using an LLM for things that are not possible otherwise.
Of course it’s damn expensive and so hard to do in a high quality way it’s rare luxury, for now…
It's always a tradeoff, and most of the time chunking and keeping the context short performs better.
I feed long context tasks to each new model and snapshot just to test the performance improvements, and every time it's immediately obvious that no current model can handle its own max context. I do not believe any benchmarks, because contrary to the results of many of them, no matter what the (coding) task is, the results start getting worse after just a couple dozen thousand tokens, and after a hundred the accuracy becomes unacceptable. Lost-in-the-middle is still a big issue as well, at least for reasoning if not for direct recall - despite benchmarks showing it's not. LLMs are still pretty unreliable at one-shotting big things, and everything around it is still alchemy.
1 million tokens is still not enough for real life codebases (100Ks to millions loc)