How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.
Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.
I think we agree. Interacting with employees is not an engineering discipline, and neither is prompting.
I'm not objecting to the incantations or the vibes per se. I'm happy to use AI and try different methods to get the results I want. I just don't understand the claims that prompting is a type of engineering. If it were, then you would need benchmarks.