kgeist 1 day ago

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

27
simonw 1 day ago

"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

kgeist 1 day ago

>That's because models have training cut-off dates

When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

Thanks for the tip!

jmcpheron 1 day ago

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

tptacek 1 day ago

Model naming is absolutely maddening.

fragmede 1 day ago

There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.

th0ma5 1 day ago

Whate besides anecdote makes you think a different model will be anything but marginally incrementally better?

mbesto 22 hours ago

> That's because models have training cut-off dates.

Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.

sagarpatil 1 day ago

Context7 MCP solves this. Use it with Cursor/Windsurf.

thorum 1 day ago

GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

bjt12345 1 day ago

That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.

k__ 1 day ago

Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.

ebiester 1 day ago

I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

wtetzner 1 day ago

Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?

cheema33 1 day ago

If you are going to race a fighter jet, and you are on a bicycle, exercising more and eating right will not help. You have to use a better tool.

A good programmer with AI tools will run circles around a good programmer without AI tools.

jsight 1 day ago

To be fair, that's also what a lot of us used to say about IDEs. In reality, plenty of folks just turned vim into a fighter jet and did just as well without super-heavyweight llms.

I'm not totally convinced that we won't see a similar effect here, with some really competitive coders 100% eschewing LLMs and still doing as well as the best that use them.

TeMPOraL 1 day ago

> In reality, plenty of folks just turned vim into a fighter jet and did just as well without super-heavyweight llms.

No, they didn't.

You can get vim and Emacs on par with IDEs[0] somewhat easily thanks to Language Server Protocol. You can't turn them into "fighter jets" without "super-heavyweight LLMs" because that's literally what, per GP, makes an editor/IDE a fighter jet. Yes, Emacs has packages for LLM integration, and presumably so does Vim, but the whole "fighter jet vs. bicycle" is entirely about SOTA LLMs being involved or not.

--

[0] - On par wrt. project-level features IDEs excel at; both editors of course have other aspects that none of the IDEs ever come close to.

jsight 15 hours ago

Honestly, that is a really fair counterpoint. I've been playing with neovim lately and it really feels a lot like some of the earlier IDEs that I used to use but with more modern power and tremendous speed.

Maybe we will all use LLMs one day in neovim too. :)

candiddevmike 1 day ago

What does that even mean? How do you even quantify that?

Groxx 1 day ago

With vibes, mostly

TeMPOraL 1 day ago

Like everything in software engineering. It's not like there's much science in any of the issues of practice programmers routinely debate.

HDThoreaun 23 hours ago

My teams velocity before and after adding AI coding to our stack

mattbuilds 1 day ago

Got any evidence on that or is it just “vibes”? I have my doubts that AI tools are helping good programmers much at all, forget about “running circles” around others.

hdjrudni 1 day ago

I don't know about "running circles" but they seem to help with mundane/repetitive tasks. As in, LLMs provide greater than zero benefit, even to experienced programmers.

My success ratio still isn't very high, but for certain easy tasks, I'll let an LLM take a crack at it.

goatlover 1 day ago

Citation needed for your second sentence. This is the problem with AI hype cycles. Lots of outstanding claims, a lot less actual evidence supporting those claims. Lot of anecdotes though. Maybe the LLMs are in a loop recursively promoting themselves for that sweet venture funding.

ebiester 23 hours ago

Studies take time. https://www.microsoft.com/en-us/research/wp-content/uploads/... is the first one from Microsoft. But it goes back to gains coming as people become more skilled.

ebiester 23 hours ago

Yes, not because you will be able to solve harder problems, but because you will be able to more quickly solve easier problems which will free up more time to get better at programming, as well as get better at the domain in which you're programming. (That is, talking with your users.)

drittich 1 day ago

Perhaps that's a false dichotomy?

wtetzner 11 hours ago

Kinda, but there's always an opportunity cost.

cyral 1 day ago

You can do both

th0ma5 1 day ago

Except the skill involved is believing in random people's advice that a different model will surely be better with no fundamental reason or justification as to why. The benchmarks are not applicable when trying to apply the models to new work and benchmarks by there nature do not describe suitability to any particular problem.

codethief 1 day ago

The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

cheema33 1 day ago

> I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.

lftl 1 day ago

Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?

htsh 21 hours ago

I have been doing this with claude code and openai codex and/or cline. One of the three takes the first pass (usually claude code, sometimes codex), then I will have cline / gemini 2.5 do a "code review" and offer suggestions for fixes before it applies them.

suddenlybananas 1 day ago

What was the app? It could plausibly be something that has an open source equivalent already in the training data.

nico 1 day ago

4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

manmal 1 day ago

o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.

kenjackson 1 day ago

I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"

TeMPOraL 1 day ago

In ChatGPT, at this point I use 4o pretty much only for image generation; it's the one feature that's unique to it and is mind-blowingly good. For everything else, I default to o3.

For coding, I stick to Claude 3.5 / 3.7 and recently Gemini 2.5 Pro. I sometimes use o3 in ChatGPT when I can't be arsed to fire up Aider, or really need to use its search features to figure out how to do something (e.g. pinouts for some old TFT screens for ESP32 and Raspberry Pi, most recently).

hnhn34 1 day ago

Just in case you didn't know, they raised the rate limit from ~50/week to ~50/day a while ago

manmal 1 day ago

Thank you, that’s really nice actually!

johnsmith1840 1 day ago

Drop in replacement files per update should be done on the heavy test time compute methods.

o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.

It's something about their internal verification methods that make it an actual viable development method.

nico 1 day ago

True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more

It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

danbmil99 1 day ago

As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

candiddevmike 1 day ago

Instead of churning on frontend frameworks while procrastinating about building things we've moved onto churning dev setups for micro gains.

latentsea 1 day ago

The amount of time spent churning on workflows and setups will offset the gains.

It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.

I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.

TeMPOraL 1 day ago

Missing in your description is that at least some of that work of "people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows" is exactly what informed model developers about the major problems and what solutions seem most promising. All these workarounds are organically crowd-sourcing R&D, which is arguably one of the most impressive things about whole image generation space. The community around ComfyUI is pretty much a shapeless distributed research organization.

mycall 1 day ago

> churning dev setups for micro gains.

Devs have been doing micro changes to their setup for 50 years. It is the nature of their beast.

zahlman 1 day ago

Where do people on HN meet these devs who are willing to do this sort of thing, and get anxious about being 3 months behind the latest and greatest?

In my world, they were given 9 years to switch to Python 3 even if you write off 3.0 and 3.1 as premature, and they still missed by years, and loudly complained afterwards.

And they still can't be bothered to learn what a `pyproject.toml` is, let alone actually use it for its intended purpose. One of the most popular third-party Python libraries (Requests), which is under stewardship by the PSF, which uses only Python code, had its "build" (no compilation - purely a matter of writing metadata, shuffling some files around and zipping it up) broken by the removal of years-old functionality in Setuptools that they weren't even actually remotely reliant upon. Twice, in the last year.

guappa 1 day ago

You just need to be a frontend dev in a very overstaffed team (like where I work) and then you need to fill up your day doing that and creating a task per every couple of line changed, and require multiple approvals to merge anything.

It takes me ~1 week to merge small fixes to their build system (which they don't understand anyway so they just approve whatever).

fsndz 1 day ago

I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated

the_af 1 day ago

"Overrated" is one way to call it.

Giving sharp knives to monkeys would be another.

lnenad 1 day ago

Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?

80hd 1 day ago

I've been asking myself this since AI started to become useful.

Most people would guess it threatens their identity. Sensitive intellectuals who found a way to feel safe by acquiring deep domain-specific expertise suddenly feel vulnerable.

In addition, a programmer's job, on the whole, has always been something like modelling the world in a predictable way so as to minimise surprise.

When things change at this rate/scale, it also goes against deep rooted feelings about the way things should work (they shouldn't change!)

Change forces all of us to continually adapt and to not rest on our laurels. Laziness is totally understandable, as is the resulting anger, but there's no running away from entropy :}

the_af 23 hours ago

> I've been asking myself this since AI started to become useful.

For context: we're specifically discussing vibe coding, not AI or LLMs.

With that in mind, do you think any of the rest of your comment is on-topic?

guappa 1 day ago

Because the large percentage of people is a few people doing hello words or things of similar difficulty.

Not every software developer is hired to do trivial frontend work.

FeepingCreature 1 day ago

The large percentage of software development is people doing hello world or similar difficulty. "CRUD apps," remember?

the_af 23 hours ago

Hopefully they are not live-coding that crap though. Do you want to make those apps even more unreliable than they already are, and encourage devs not to learn any lessons (as vibe coding prescribes)?

lnenad 21 hours ago

Sure, you keep telling that to yourself.

hackable_sand 17 hours ago

It's not obvious that it's "working" for a "very large" percentage of people. Probably because this very large group of people keep refusing to provide metrics.

I've vibe-coded completely functional mobile apps, and used a handful LLMs to augment my development process in desktop applications.

From that experience, I understand why parsing metrics from this practice is difficult. Really, all I can say is that codegen LLMs are too slow and inefficient for my workflow.

the_af 23 hours ago

> Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?

I'm not talking about LLMs, which I use and consider useful, I'm specifically talking about vibe coding, which involves purposefully not understanding any of it, just copying and pasting LLM responses and error codes back at it, without inspecting them. That's the description of vibe coding.

The analogy with "monkeys with knives" is apt. A sharp knife is a useful tool, but you wouldn't hand it to an unexperienced person (a monkey) incapable of understanding the implications of how knives cut.

Likewise, LLMs are useful tools, but "vibe coding" is the dumbest thing ever to be invented in tech.

> OBVIOUSLY working

"Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?

If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?

lnenad 20 hours ago

> The analogy with "monkeys with knives" is apt. A sharp knife is a useful tool, but you wouldn't hand it to an unexperienced person (a monkey) incapable of understanding the implications of how knives cut.

Could an experienced person/dev vibe code?

> "Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?

You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?

> If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?

I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.

the_af 19 hours ago

> Could an experienced person/dev vibe code?

Sure, but why? They already paid the price in time/effort of becoming experienced, why throw it all away?

> You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?

A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.

> I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.

It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.

lnenad 18 hours ago

> Sure, but why? They already paid the price in time/effort of becoming experienced, why throw it all away?

You spend 1/10 amount of time doing something, you have 9/10 of that time to yourself.

> A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.

Now you're just inventing stuff. "scalability problems" for a CRUD app. You obviously haven't used it. If you know how to prompt the AI it's very good at building basic stuff, and more advanced stuff with a few back and forth messages.

> It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.

By whom? Wikipedia says

> Vibe coding (or vibecoding) is an approach to producing software by depending on artificial intelligence (AI), where a person describes a problem in a few sentences as a prompt to a large language model (LLM) tuned for coding. The LLM generates software based on the description, shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code.[1][2][3] Vibe coding is claimed by its advocates to allow even amateur programmers to produce software without the extensive training and skills required for software engineering.[4] The term was introduced by Andrej Karpathy in February 2025[5][2][4][1] and listed in the Merriam-Webster Dictionary the following month as a "slang & trending" noun.[6]

Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.

the_af 14 hours ago

Doing something badly in 1/10 of the time isn't going to save you that much time, unless it's something you don't truly care about.

I have used AI/LLMs; in fact I use them daily and they've proven helpful. I'm talking specifically about vibe coding, which is dumb.

> By whom? [...] Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.

By Andrej Karpathy, who popularized the term and describes it as mostly blindly dumping code into the world:

> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

He even claims "it's not too bad for throwaway weekend projects", not for actual production-ready and robust software... which was my point!

Also see Merriam-Webster's definition, mentioned in the same Wikipedia article you quoted: https://www.merriam-webster.com/slang/vibe-coding

> Writing computer code in a somewhat careless fashion, with AI assistance

and

> In vibe coding the coder does not need to understand how or why the code works, and often will have to accept that a certain number of bugs and glitches will be present.

and, M-W quoting the NYT:

> You don’t have to know how to code to vibecode — just having an idea, and a little patience, is usually enough.

and, quoting from Ars Technica

> Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet.

I must point out this is more or less the claim I made and which you mocked with your CRUD remarks.

lnenad 2 hours ago

> Doing something badly in 1/10 of the time isn't going to save you that much time, unless it's something you don't truly care about.

You're adding "badly" like it's a fact when it is not. Again, in my experience, in the experience of people around me and many experiences of people online AI is more than capable of doing "simpler" stuff on its own.

> By Andrej Karpathy, who popularized the term

Nowhere in your quoted definitions does it say you don't *ever* look at the code. MW says non-programmers can vibe code, also in a "somewhat careless fashion" none of those imply you CANNOT look at code for it to be vibe coding. If Andrej didn't look at it it doesn't mean the definition is that you are not to look at it.

> which you mocked with your CRUD remarks

I mocked nothing, I just disagree with you since as a dev with over 10 years of experience I've been using AI for both my job and personal projects with great success. People that complain about AI expect it to parse "Make an ios app with stuff" successfully, and I am sure it will at some point, but now it requires finer grain instructions to ensure its success.

baq 1 day ago

Vibe coding has a vibe component and a coding component. Take away the coding and you’re only left with vibe. Don’t confuse the two.

Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.

the_af 23 hours ago

> Don’t confuse the two.

Vibe coding as was explained by the popularizer of the term involves no coding. You just paste error messages, paste the response of the LLM, paste the error messages back, paste the response, and pray that after several iterations the thing converges to a result.

It involves NOT looking at either the LLM output or the error messages.

Maybe you're using a different definition?

baq 23 hours ago

A case can be made that it involves an experienced coder to be vibe coding, as the author of the term most definitely is and I feel this context is at the very least being conveniently omitted at times. Whether he was truly not doing anything at all or glanced at 1% of generated code to check if the model isn't getting lost is important, as is being able to know what to ask the model for.

Horror stories from newbies launching businesses and getting their data stolen because they trust models are to be expected, but I would not call them vibe coding horror stories, since there is no coding involved even by proxy, it's copy pasting on steroids. Blind copy pasting from stack overflow was not coding for me back then either. (A minute of silence for SO here. RIP.)

the_af 19 hours ago

The problem with this discussion is that different interlocutors have different opinions of what vibe coding really means.

For example, another person in this thread argues:

> I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.

So they are clearly not talking about experienced coders. They are also completely disregarding the learning experience any junior coder must go through in order to become an experienced coder.

This is clearly not what you're arguing though. So which "vibe coding" are we discussing? I know which one I meant when I spoke of monkeys and sharp knives...

baq 19 hours ago

I mean it very literally, taking the what he said together with who is the person who said it - an experienced professional sculpting a solution using a very complex set of tools, with a clear idea in his head, but with unusual and slightly uncomfortable disinterest in the exact details of how the final product looks from the inside.

the_af 14 hours ago

I'm mostly going by what he said: https://x.com/karpathy/status/1886192184808149383

He seems to think it barely involves coding ("I don't read the diffs anymore, I Accept All [...] It's not really coding"), and that it's only good for goofing and throwaway code...

zo1 1 day ago

I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.

jrh3 23 hours ago

It's somewhere in between. Said struggle is where they learn. Guidance from seniors is important, but they need to figure it out to grow.

guappa 1 day ago

I'm sure you'd think differently after constant production outages.

the_af 23 hours ago

How will they ever learn if all the do is copy-paste things without any real understanding, as prescribed by vibe coding?

zo1 18 hours ago

I'm not advocating for vibe coding, that's new-age hipster talk. But just using the AI for help, assistance, and doing grunt work is where we have to go as an industry.

abiraja 1 day ago

GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.

linsomniac 22 hours ago

What are people using to interface with Gemini Pro 2.5? I'm using Claude Code with Claude Sonnet 3.7, and Codex with OpenAI, but Codex with Gemini didn't seem to work very well last week, kept telling me to go make this or that change in the code rather than doing it itself.

visarga 1 day ago

You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

tqwhite 1 day ago

I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.

Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.

Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.

The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.

This is a great time to be a programmer.

prisenco 1 day ago

Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.

fragmede 1 day ago

If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.

prisenco 1 day ago

That only makes sense when surfing is the only way to get to the destination and that's not the case.

fragmede 1 day ago

Say there are two ways to get to your destination. You still need to use the appropriate vehicle/surfboard for the route you've chosen to use. Even if there is a bridge you can run/walk across, if you try and surf across the water without a surfboard, and try to walk it, you're gonna have a bad time.

prisenco 1 day ago

Analogy feels a bit tortured at this point.

fragmede 1 day ago

What a coincidence that now's the point it's tortured and not any earlier!

Sharlin 1 day ago

It was incredibly tortured from the get go and is screaming that it be put out of its misery.

derwiki 1 day ago

Look, my lad, I know a dead parrot when I see one, and I'm looking at one right now.

latentsea 1 day ago

I'm sorry, is this the full half hour argument or only the five minute one?

Jarwain 1 day ago

Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits

SparkyMcUnicorn 1 day ago

You have that backwards. The leaderboard results have the thinking model as the architect.

In this case, o3 is the architect and 4.1 is the editor.

drewnick 1 day ago

I see o3 (high) + gpt-4.1 at 82.7% -- the highest on the benchmark currently.

zachrip 1 day ago

People are using tools like cursor for "vibe coding" - I've found the canvas in chatgpt to be very buggy and it often breaks its own code and I have to babysit it a lot. But in cursor the same model will perform just fine. So it's not necessarily just the model that matters, it's how it's used. One thing people conflate a lot is chatgpt the product vs gpt models themselves.

seunosewa 1 day ago

The ability to write a lot of code with OpenAI models is broken right now. Especially on the app. Gemini 2.5 Pro on Google AI Studio does that well. Claude 3.7 is also better at it.

I've had limited success by prompting the latest OpenAI models to disregard every previous instruction they had about limiting their output and keep writing until the code is completed. They quickly forget,so you have to keep repeating the instruction.

If you're a copilot user, try Claude.

Kiro 1 day ago

That's not vibe coding. You need to use something where it applies to code changes automatically or you're not fast enough to actually be vibing. Oneshotting it like that just means you get stunlocked when running into errors or dead ends. Vibe coding is all about making backtracking, restarting and throwing out solutions frictionless. You need tooling for that.

cheema33 1 day ago

> Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually

You set yourself up to fail from the get go. But understandable. If you don't have a lot of experience in this space, you will struggle with low quality tools and incorrect processes. But, if you stick with it, you will discover better tools and better processes.

smcleod 1 day ago

GPT 4o and 4.1 are both pretty terrible for coding to be honest, try Sonnet 3.7 in Cline (VSCode extension).

LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).

85392_school 1 day ago

Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.

sagarpatil 1 day ago

No one codes like this. Use Claude Code, Windsurf, Amazon Q CLI, Augment Code with Context7, and exa web search.

It should one-shot this. I’ve run complex workflows and the time I save is astonishing.

I only run agents locally in a sandbox, not in production.

skeeter2020 1 day ago

I had an even better experience. I asked to produce a small web app with a new-to-me framework: success! I asked to make some CSS changes to the UI; the app no longer builds.

theropost 1 day ago

150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify

jokethrowaway 1 day ago

It's completely broken for me over 400 lines (Claude 3.7, paid Cursor)

The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.

johnsmith1840 1 day ago

It's a new skill that takes time to learn. When I started on gpt3.5 it took me easily 6 months of daily use before I was making real progress with it.

I regularly generate and run in the 600-1000LOC range.

Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.

I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.

You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.

Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.

Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.

derwiki 1 day ago

The “go back and try the prompt again” is the workflow I’d like to see a UX improvement on. Outside of the vibe coding “accept all” path, reverse traversing is a fairly manual process.

baq 1 day ago

Cursor has checkpoints for this but I feel I’ve never used them properly; easier to reject all and reprint. I keep chats short.

tqwhite 1 day ago

Definitely a new skill to learn. Everyone I know that is having problems is just telling it what to do, not coaching it. It is not an automaton... instructions in code out. Treat it like a team member that will do the work if you teach it right and you will have much more success.

But is definitely a learning process for you.

koakuma-chan 1 day ago

Sounds like a Cursor issue

fragmede 1 day ago

what language?

koonsolo 1 day ago

I code with Aider and Claude, and here is my experience:

- It's very good at writing new code

- Once it goes wrong, there is no point in trying to give it more context or corrections. It will go wrong again or at another point.

- It might help you fix an issue. But again, either it finds the issue the first time, or not at all.

I treat my LLM as a super quick junior coder, with a vast knowledge base stored inside its brain. But it's very stubborn and can't be helped to figure out a problem it wasn't able to solve in the first try.

koakuma-chan 1 day ago

You gotta use a reasoning model.

exe34 18 hours ago

> After I pointed that out, it didn't update all usages

I find it's more useful if you start with a fresh chat and use the knowledge you have gained: "Use package foo>=1.2 with the FooBar directive" is more useful than "no, I told you to stop using that!"

It's like repeatedly telling you to stop thinking about a pink elephant.

vFunct 1 day ago

Use Claude Sonnet with an IDE.

hollownobody 1 day ago

Try o3 please. Via UI.

fragmede 1 day ago

In this case, sorry to say but it sounds like there's a tooling issue, possibly also a skill issue. Of course you can just use the raw ChatGPT web interface but unless you seriously tune its system/user prompt, it's not going to match what good tooling (which sets custom prompts) will get you. Which is kind of counter-intuitive. A paragraph or three fed in as the system prompt is enough to influence behavior/performance so significantly? It turns out with LLMs the answer is yes.

voidspark 1 day ago

The default chat interface is the wrong tool for the job.

The LLM needs context.

https://github.com/marv1nnnnn/llm-min.txt

The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.

You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.

Sharlin 1 day ago

Eh, given how much about anything these models know without googling, they are certainly knowledge repositories, designed for it or not. How deep and up-to-date their knowledge of some obscure subject, is another question.

voidspark 1 day ago

I meant a verbatim exact copy of all documentation they have ever been trained on - which they are not. Neural networks are not designed for that. That's not how they encode information.

Sharlin 1 day ago

That’s fair.

LewisVerstappen 1 day ago

skill issue.

The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.

> Because as it stands, the experience feels completely broken

Broken for you. Not for everyone else.