mobilejdral 2 days ago

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

4
tlb 1 day ago

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

TZubiri 2 days ago

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

mopierotti 1 day ago

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

mopierotti 1 day ago

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

econ 1 day ago

It needs a bit more reasoning as it does find the answer but doesn't notice it found it.

The answer is: A trick question.

mopierotti 1 day ago

Yeah. In the example I shared, my charitable interpretation would be that it's identifying the trick question as "a setup" where the punch line is the confusion the audience experiences. And in a meta sense, that would also describe the form of the entire chat.

econ 1 day ago

To state the obvious in case it wasn't: A trick question can be both a joke and a rhethorical question.

acrooks 1 day ago

Claude responded “Nothing.”

genewitch 1 day ago

"That look on your face, apparently"

latentsea 2 days ago

> what would you answer if I asked you this question?

I don't know.

namaria 2 days ago

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

rovr138 1 day ago

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

namaria 1 day ago

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

blagie 1 day ago

They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

[ ] Don't use

Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

namaria 1 day ago

> it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught)

I don't understand what you mean

> For example, if personal data was being sold by a data broker and being used by hedge funds to trade

It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?

I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.

A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

gvhst 1 day ago

SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification

diggan 1 day ago

That's interesting, how do I get access to those audits/reports given I'm just an end-user?

rovr138 1 day ago

You can fill the form here, https://trust.openai.com/

namaria 1 day ago

The audit performed by a private entity called "Insight Assurance"?

Why do you trust it?

rovr138 1 day ago

Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

namaria 1 day ago

No, you're imposing a false dichotomy.

I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.

rovr138 1 day ago

Not a false dichotomy. I'm just calling out the rhetorical gymnastics.

You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?

You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.

So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".

If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.

namaria 1 day ago

If anyone performing "rhetorical gymnastics" here is you. I've explained my position in very straightforward words.

I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.

This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".

I don't feel compelled to respond any further to fallacies and attacks.

rovr138 1 day ago

You’re not the only one that’s worked with audits.

I get I won’t get a reply, and that’s fine. But let’s be clear,

> I've explained my position in very straightforward words.

You never explained what would be enough proof which is how this all started. Your original post had,

> Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself.

You now said

> I have an informed opinion on what I find trustworthy in that domain.

Which again, you failed to expand on.

So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up.

So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’.

I’m out too. Good luck.

golergka 2 days ago

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

mobilejdral 2 days ago

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

henryway 2 days ago

Can God create something so heavy that he can’t lift it?

viraptor 2 days ago

There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.