The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!
I realised I didn't mention it in the article, so in case you're curious it cost about $116 to run the 100k token version 100 times.
So, half that for batch processing [1], which presumably would be just fine for this task?
thank you, I was going to ask about this. It's not a crazy amount...
Do we know how that relates to actual operating cost? My understanding is that this is below cost price because we're still in the investor hype part of the cycle where they're trying to capture market share by pumping many millions into these companies and projects
Does this really reflect the resource cost of finding this vulnerability?
It sounds like a crazy amount to me. I can run code analyzers/sanitizers/fuzzers on every commit to my repo at virtually no cost. Would they have caught a problem like this? Maybe not, certainly not without some amount of false positives. Still this LLM approach costs many millions of times more than previous tooling, and might still have brought up nothing (we just don't read the blog posts about those attempts).
Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.
When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.
Except in this case the LLM was pointed at a known-to-exist vulnerability. $116 per handler per vulnerability type, unknown how many vulnerabilities exist.
The o3 discovered a new zero day exploit, it wasn't known previously, it's not the same one found by the author.
A lot of money is all you need~
A lot of burned coal, is what.
The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!
Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.
So basically running a microwave for about 800 seconds, or a bit more than 13 minutes per model?
Oh my god - the world is gonna end. Too bad, we panicked because of exaggerated energy consumption numbers for using an LLM when doing individual work.
Yes - when a lot of people do a lot of prompting, these 0ne tenth of a second to 8 seconds of running the microwave per prompt adds up. But I strongly suggest, that we could all drop our energy consumption significantly using other means, instead of blaming the blog post's author about his energy consumption.
The "lot of burned coal" is probably not that much in this blog post's case given that 1 kWh is about 0.12 kg coal equivalent (and yes, I know that we need to burn more than that for 1kWh. Still not that much, compared to quite a few other human activities.
If you want to read up on it, James O'Donnell and Casey Crownhart try to pull together a detailed account of AI energy usage for MIT Technology Review.[1] I found that quite enlightening.
[1]: https://www.technologyreview.com/2025/05/20/1116327/ai-energ...
The better answer is just "I don't care".
Because I definitely don't care. Energy expenditure numbers are always used in isolation, lest any one have to deal with anything real about them, and always are content to ignore the abstraction which electricity is - namely, electricity is not coal. It's electricity. Unlike say, driving my petrol powered car, the power for my computers might come from solar panels, coal, nuclear power stations, geothermal power hydro...
Which is to say, if people want to worry about electricity usage: go worry about it by either building more clean energy, or campaigning to raise electricity prices.
Funny, I actually care. But I try to direct my care towards the real culprits.
So about 50% of CO2 emissions in Germany come from 20 sources. The campaigns like personal footprint (invented by BP) are there to shift the blame to consumers. Away from those with the biggest impact and the most options for action.
So yes, I f**ng don’t care if a security researcher leaves his microwave equivalent running for a few minutes. But I care, campaign in the bigger sense and also orient my own consumption wherever possible towards cleaner options.
Full well knowing that even as mostly being reasonable in my consumption, I definitely belong to those 5-10% of earth's population who drive the problem. Because more than half of the population in the so called first world live according to the Paris Climate Agreement. And it’s not the upper half of.
Between $3k and $30k to solve a single ARC-AGI problem [1]. Not sure if "100 runs" makes this comparable.
[1] https://techcrunch.com/2025/04/02/openais-o3-model-might-be-...
I think it gave up trying to solve Pokemon. :) Seriously, aren't these ARC-AGI problems easy for most people? They usually involve some sort of pattern recognition and visual reasoning.
And how do you know what the purely-human-driven energy expenditure would have been?
How much longer would OP have needed to find the same vulnerability without LLM help? Then multiply that by the energy used to produce 2000kcal/day of food as well as the electricity for running their computer.
Usually LLMs come out far ahead in those types of calculations. Compared to humans they are quite energy efficient
Those types of calculation are extremely disingenuous.
What exactly is disingenuous about it?
It reduces the value of a human life to the incremental rate at which they produce some concrete product. It is absurd.
Or, it elevates the tasks artificial intelligence produces to the actual difficulty of them - the human effort.
You're not thinking this through. Your human life (with its associated 2000 Cal/day) does so much more than find bugs in obscure codebases. Or at least, one would hope.
"100 times for each of the models" represents a significant amount of energy burned. The achievement of finding the most common vulnerability in C based codebases becomes less of an achievement. And more of a celebration of decadence and waste.
We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.