"Does AI finally enable truly humane interfaces?"
I think it does; LLMs in particular. AI also enables a ton of other things, many of them inhumane, which can make it very hard to discuss these things as people fixate on the inhumane. (Which is fair... but if you are BUILDING something, I think it's best to fixate on the humane so that you conjure THAT into being.)
I think Jef Raskin's goal with a lot of what he proposed was to connect the computer interface more directly with the user's intent. An application-oriented model really focuses so much of the organization around the software company's intent and position, something that follows us fully into (most of) today's interfaces.
A magical aspect of LLMs is that they can actually fully vertically integrate with intent. It doesn't mean every LLM interface exposes this or takes advantage of this (quite the contrary!), but it's _possible_, and it simple wasn't possible in the past.
For instance: you can create an LLM-powered piece of software that collects (and allows revision) to some overriding intent. Just literally take the user's stated intent and puts it in a slot in all following prompts. This alone will have a substantial effect on the LLMs behavior! And importantly you can ask for their intent, not just their specific goal. Maybe I want to build a shed, and I'm looking up some materials... the underlying goal can inform all kinds of things, like whether I'm looking for used or new materials, aesthetic or functional, etc.
To accomplish something with a computer we often thread together many different tools. Each of them is generally defined by their function (photo album, email client, browser-that-contains-other-things, and so on). It's up to the human to figure out how to assemble these, and at each step it's easy to lose track, to become distracted or confused, to lose track of context. And again an LLM can engage with the larger task in a way that wasn't possible before.
Tell me, how does doing any of the things you've suggested help with the huge range of computer-driven tasks that have nothing to do with language? Video editing, audio editing, music composition, architectural and mechanical design, the list is vast and nearly endless.
LLMs have no role to play in any of that, because their job is text generation. At best, they could generate excerpts from a half-imagined user manual ...
Because some LLMs are now multimodal—they can process and generate not just text, but also sound and visuals. In other words, they’re beginning to handle a broader range of human inputs and outputs, much like we do.
Those are not LLMs. They use the same foundational technology (pick what you like, but I'd say transformers) to accomplish tasks that require entirely different training data and architectures.
I was specifically asking about LLMs because the comment I replied to only talked about LLMs - Large Language Models.
At this point in time calling a multimodal LLM an LLM is pretty uncontroversial. Most of the differences lie in the encoders and embedding projections. If anything I'd think MoE models are actually more different from a basic LLM than a multimodal LLM is from a regular LLM.
Bottom line is that when folks are talking about LLM applications, multimodal LLMs, MoE LLMs, and even agents are all in the general umbrella.
Everything has to do with language! Language is a way of stating intention, of expression something before it exists, of talking about goals and criteria. Everything example you give can be described in language. You are caught up in the mechanisms of these tools, not the underlying intention.
You can describe your intention in any of these tools. And it can be whatever you want... maybe your intention in an audio editor is "I need to finish this before the deadline in the morning but I have no idea what the client wants" and that's valid, that's something an LLM can actually work with.
HOW the LLM is involved is an open question, something that hasn't been done very well, and may not work well when applied to existing applications. But an LLM can make sense of events and images in addition to natural language text. You can give an LLM a timestamped list of UI events and it can actually infer quite a bit about what the user is actually doing. What does it do with that understanding? We're going to have to figure that out! These are exciting times!
What if you could pilot your video editing tool through voice? Have a multimodal LLM convert your instructions into some structured data instruction that gets used by the editor to perform actions.
Compare pinch zoom to the tedious scene in Bladerunner where Deckard is asking the computer to zoom in to a picture.
Zooming is a bad example (because pinch zoom is just so much better than that scene hah.) Instead "go back 5 frames, and change the color grading. Make the mood more pensive and bring out blues and magentas and fewer yellows and oranges." That's a lot faster than fiddling with 2-3 different sliders IMO.
> Zooming is a bad example (because pinch zoom is just so much better than that scene hah.) Instead "go back 5 frames, and change the color grading. Make the mood more pensive and bring out blues and magentas and fewer yellows and oranges." That's a lot faster than fiddling with 2-3 different sliders IMO.
Eh. That's not as good as being skilled enough to know exactly what you want and have the tools to make that happen.
There's something to be said for tools that give you the power of manipulating something efficiently, than systems that do the manipulation for you.
> Eh. That's not as good as being skilled enough to know exactly what you want and have the tools to make that happen.
I mean, do you know that? A tool that offers this audible fluent experience needs to exist before you can make that assessment right? Or are vibes alone a strong enough way to make this judgement? (There's also some strong "Less space than a Nomad. Lame" energy in this post lol.)
Moreover why can't you just have both? When I fire up Lightroom, sure I have easy mode sliders to affect "warmth" but then I have detailed panels that let me control the hue and saturation of midtones. And if those panels aren't enough I can fire up Photoshop and edit to my heart's content.
Nothing is stopping you from taking your mouse in hand at any point and saying "let me do it" and pausing the LLM to let you handle the hard bits. The same way programmers rely on compliers to generate most machine or VM code and only write machine code when the compiler isn't doing what the programmer wants.
So again, why not?
> So again, why not?
Because at my heart I'm a humanist, and I want tools that allow and encourage humans to have and express mastery themselves.
> Nothing is stopping you from taking your mouse in hand at any point and saying "let me do it" and pausing the LLM to let you handle the hard bits. The same way programmers rely on compliers to generate most machine or VM code and only write machine code when the compiler isn't doing what the programmer wants.
IMHO, good tools are deterministic, so a compiler (to use your example) is a good tool, because you can learn how it functions and gain mastery over it.
I think an AI easy-button is a bad tool. It may get the job done (after a fashion), but there's no possibility of mastery. It's making subjective decisions and is too unpredictable, because it's taking the task on itself.
And I don't think bad tools should be built, because the weaknesses of human psychology. Something is stopping you "from taking your mouse in hand at any point and saying 'let me do it'," and its those weaknesses. You either take the shortcut or have to exercise continuous willpower to decline it, which can be really hard and stressful. I don't think we should build bad tools that should put people in that situation.
And you're not going to make any progress with me by arguing based on precedent of some widely-used bad tool. Those tools were likely a mistake too. For a long time, our society has been putting technology for its own sake ahead of people.
> And you're not going to make any progress with me by arguing based on precedent of some widely-used bad tool. Those tools were likely a mistake too. For a long time, our society has been putting technology for its own sake ahead of people.
Your comment is pretty frustrating. HN has definitely become more "random internet comments" forum over the years from its more grounded focus. But even when "random internet comments" talk to each other, you expect a forthrightness to discuss and talk. My reading of your comment is that you have a strong opinion, you're injecting that opinion, but you're not open to discussion on your opinion. This statement makes me feel like my time spent replying to you was a waste.
Moreover I feel like an attitude of posting but not listening when using internet forums is corrosive. In fact, when you call yourself a humanist, this confuses and frustrates me even more because I feel it's human to engage with an argument or just stop discussing when engagement is fruitless. Stating your opinion constantly without room for discussion seems profoundly inhuman to me, but I also suspect we're not going to have a productive discussion from here so I will heed my own feelings and disengage. Have a nice day.
> My reading of your comment is that you have a strong opinion, you're injecting that opinion, but you're not open to discussion on your opinion. This statement makes me feel like my time spent replying to you was a waste.
Eh, whatever. I was just trying to prevent the possibility of a particularly tiresome cookie-cutter "argument" I've seen a million times around here. I don't know if you were actually going to make it, but we're in the context where it's likely to pop up, and it'd just waste everyone's time.
Also this isn't really opinion territory, it's more values territory.
Training LLMs to generate some internal command structure for a tool is conceptually similar to what we've done with them already, but the training data for it is essentially non-existent, and would be hard to generate.
My experience has been that generating structured output with zero, one, and few-shot prompts works quite well. We've used it at $WORK for zero-shot stuff and it's been good enough. I've done few-shot prompting for some personal projects and it's been solid. JSON Schema based enforcement of responses with temperature 0 settings works quite well. Sometimes LLMs hallucinate their responses but if you keep output formats fairly constrained (e.g. structured dicts of booleans) it decreases hallucinations and even when they do hallucinate, at temperature 0 it seems to stay within < 0.1% of responses even with zero-shot prompting. (At least with datasets and prompts I've considered.)
(Though yes, keep in mind that 0.1% hallucination = 99.9% correctness which is really not that high when we're talking about high reliability things. With zero-shot that far exceeded my expectations though.)