airylizard 2 days ago

Why I came up with TSCE(Two-Step Contextual Enrichment).

+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.

Free open framework, check the repo try it yourself

https://github.com/AutomationOptimization/tsce_demo

I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".

Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.

It works, all the data as well as the entire script used for testing is in the repo.

2
arnaudsm 2 days ago

That's a lot of kilo-watt-hours wasted for a find and replace operation.

Have you heard of text.replace("—", "-") ?

airylizard 2 days ago

The test isn't for how well an LLM can find or replace a string. It's for how well it can carry out given instructions... Is that not obvious?

thegeomaster 2 days ago

I slightly tweaked your baseline em dash example and got 100% success rate with GPT-4.1 without any additional calls, token spend, or technobabble.

System prompt: "Remove every em-dash (—) from the following text while leaving other characters unchanged.\n\nReturn only the cleaned text."

User prompt: <prompt from tsce_chat.py filled with em dashes>

Temperature: 0.0

airylizard 2 days ago

Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.

For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!

thegeomaster 2 days ago

> Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].

Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.

[1]: https://platform.openai.com/docs/models/gpt-4.1

airylizard 1 day ago

Right, the 4.1 training checkpoint hasn’t moved. What has moved is the glue on top: decoder heuristics / safety filters / logit-bias rules that OpenAI can hot-swap without re-training the model. Those “serving-layer” tweaks are what stomped the obvious em-dash miss for short, clean prompts. So the April-14 weights are unchanged, but the pipeline that samples from those weights is stricter about “don’t output X” than it was on day one. By all means, keep trying to poke holes! I’ve got nothing to sell; just sharing insights and happy to stress-test them.