I am a member of the syftr team. Please feel free to ask questions.
This looks super cool! Seems like a stronger statistics-based optimization strategy than https://docs.auto-rag.com/optimization/optimization.html.
I've got a few questions:
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Suppose I want to create and optimize for my own LLM-as-a-judge metrics like https://mastra.ai/en/docs/evals/textual-evals#available-metr..., how can I do this?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
Given section A7 in your paper: https://arxiv.org/pdf/2505.20266
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?