kweingar 6 days ago

How do we benchmark these different methodologies?

It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

8
nindalf 6 days ago

The author is up front about the limitations of their prompt. They say

> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

0points 5 days ago

Author seems to downplay their own expertise and attribute it to the LLM, while at the same time admitting he's vibe prompting the LLM and dismissing wrong results while hyping the ones that happen to work out for him.

This seems more like wishful thinking and fringe stuff than CS.

pixl97 5 days ago

Science starts at the fringe with a "that's interesting"

The interesting thing here is the LLM can come to very complex correct answers some of the time. The problem space of understanding and finding bugs is so large that this isn't just by chance, it's not like flipping a coin.

The issue for any particular user is the amount of testing required to make this into science is really massive.

mrlongroots 6 days ago

I think there's two aspects around LLM usage:

1. Having workflows to be able to provide meaningful context quickly. Very helpful.

2. Arbitrary incantations.

I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.

As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.

TrapLord_Rhodo 5 days ago

> ksmbd has too much code for it all to fit in your context window in one go. Therefore you are going to audit each SMB command in turn. Commands are handled by the __process_request function from server.c, which selects a command from the conn->cmds list and calls it. We are currently auditing the smb2_sess_setup command. The code context you have been given includes all of the work setup code code up to the __process_request function, the smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up to a depth of 3 function calls.

The author deserves more credit here, than just "vibing".

kristopolous 6 days ago

I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "

I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.

The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.

I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"

hollerith 6 days ago

Should be "wary".

kristopolous 6 days ago

oh interesting, I somehow survived 42 years and didn't know there were 2 words there. I'll check my prompts and give it a go. Thanks.

ValentineC 6 days ago

I'd be weary of the model doing incorrect things too. Nice prompt though! I'll try it out in Roo soon.

Now I wonder how the model reasons between the two words in that black box of theirs.

kristopolous 6 days ago

I was coding a chatting bot with an agent like everyone else at https://github.com/day50-dev/llmehelp and I called the agent "DUI" mode because it's funny.

However, as I was testing it, it would do reckless and irresponsible things. After I changed it, as far as bot communication, to "Do-Ur-Inspection" mode and it became radically better.

None of the words you give it are free from consequences. It didn't just discard the "DUI" name as a mere title and move on. Fascinating lesson.

naasking 5 days ago

> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.

kweingar 5 days ago

I can't think of many engineering disciplines that do things this way. "This seems to work, I don't know how or why it works, I don't even know if it's possible to know how or why it works, but I will just apply this moving forward, crossing my fingers that in future situations it will work by analogy."

If the act of discovery and iterative refinement makes prompting an engineering discipline, then is raising a baby also an engineering discipline?

naasking 5 days ago

Lots of engineering disciplines work this way. For instance, materials science is still crude, we don't have perfect theories for why some materials have the properties they do (like concrete or superconductors), we simply quantify what those properties are under a wide range of conditions and then make use of those materials under suitable conditions.

> then is raising a baby also an engineering discipline?

The key to science and engineering is repeatability. Raising a baby is an N=1 trial, no guarantees of repeatability.

limflick 5 days ago

I think the point is that it's more about trial and error, and less about blindly winging it. When you don't know how a system seems to work, you latch on to whatever seems to initially work and proceed from there to find patterns. It's not an entire approach to engineering, just a small part of the process.

p0w3n3d 6 days ago

Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer

dotancohen 5 days ago

I recall this even being in the Anthropic documentation.

dotancohen 5 days ago

Here, found it:

  > Use XML tags to structure your prompts

  > There are no canonical “best” XML tags that Claude has been trained with in particular, although we recommend that your tag names make sense with the information they surround.
https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

justsomehnguy 5 days ago

My guess would be there is enough training materiel what a mere tagging sometging is enough to have a bigger SNR.

victor106 5 days ago

Could not find it. Can you please provide a link?

p0w3n3d 5 days ago

https://youtu.be/7xTGNNLPyMI?si=eaqVjx8maPtl1STJ

He shows how the prompt is parsed etc. Very nice and eye opening. Also superstition dispelling

stingraycharles 5 days ago

It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.

But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.

Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.

ptdnxyz 5 days ago

How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.

Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.

kweingar 5 days ago

I think we agree. Interacting with employees is not an engineering discipline, and neither is prompting.

I'm not objecting to the incantations or the vibes per se. I'm happy to use AI and try different methods to get the results I want. I just don't understand the claims that prompting is a type of engineering. If it were, then you would need benchmarks.