You’re confusing engineering with maths. You engineer your prompting to maximize the chance the LLM does what you need - in your example, the true answer - to get you closer to solving your problem. It doesn’t matter what the LLM does internally as long as the problem is being solved correctly.
(As an engineer it’s part of your job to know if the problem is being solved correctly.)
Maybe very very soft "engineering". Do you have metrics on which prompt is best? What units are you measuring this in? Can you follow a repeatable process to obtain a repeatable result?