Interesting but I'm a bit lost. You are optimising but how do you know the ground truth of "good" and "bad"? Do you manually run the workflow and then decide based on a predefined metric?
Or do you rely on generic benchmarks?
https://github.com/datarobot/syftr/blob/main/docs/datasets.m...
You need custom QA pairs for custom scenarios.