Item 44074253 - HN

seveibar • 7 days ago

This solves a major problem that I built an npm package called "pgstrap"[1] for. It generates a "database structure" directory so that my database schema is available to LLMs (it also makes code review easier because you can see the changes to various tables). So I have a SQL file for each table in my database, neatly organized into directories for each schema. Rails has a similar idea with schema.rb

I'm not sure whether or not it's better to have your editor database-aware or to have your codebase have appropriate context committed. On one hand, less generated code/artifacts make for a cleaner codebase. On the other hand, not everyone uses VC Code or will know how to use this integration. Database browser GUIs have never really had a single winner. That said, VS Code does have enough dominance to potentially make themselves "the standard way to view a database in development"

[1] https://github.com/seveibar/pgstrap

4

semiquaver • 7 days ago

I’m confused. Isn’t including the canonical state of the database schema in version control along with all the migrations that brought it to that point a completely standard part of every web framework?

xyzzy123 • 7 days ago

It often is, but the schema might be written against a language-specific ORM.

That code might in turn have plugins or feature flags that mean you don't know the concrete sql schema until runtime.

Same for seed data and migrations.

So it depends on the use-case how useful this format is for tooling and discovery vs an actual connection to the database.

tough • 6 days ago

Sane ORMs will still use sql for migration files

cortesoft • 6 days ago

There are a lot of advantages to using a DSL for migration files rather than pure sql. For one, many times you will get both forward and reverse migration ability automatically, to allow rollbacks (with the ability to mark certain migrations as unable to rollback). You also can have the DSL utilize different features depending on which database you are using without having to change your migration. You can also use special syntax to designate relations between tables easily without having to write the boilerplate SQL every time.

If you are using an ORM, using the same language and mental model to describe your database structure as you do to interact with the database makes a lot of sense.

OJFord • 6 days ago

I agree it's much better to, but the biggest ones are probably Django & Rails (?), and they don't.

tough • 6 days ago

yeah tbh i was mostly thinking about the newer crop of ts/js ones (kysely/drizzle) vs the earlier mess that TypeORM or others where on its place, so at least its not so bad.

jen20 • 6 days ago

The _canonical_ state is what is in the production database(s). What's in version control is hopefully able to recreate it, with the obvious caveat of being highly unlikely to be able to repopulate the data.

dotancohen • 6 days ago

For development, the actual data itself is not so important, but the features of the data are extremely important. Such as which fields have higher or lower cardinality, which fields are accessed often and which are barely touched.

Often times, the indexes will reflect this. Often times, not.

cerved • 7 days ago

I'm more confused why the version control of the thing using the database is including the entire schema of the database in it's repository

schrodinger • 7 days ago

The database schema of an app is tightly coupled enough to essentially be code, and migrations are also probably checked in. This lets you see how the db schema changes over time -- likely along with the queries using the schema.

int_19h • 6 days ago

This is only true when your app is the only thing that is using the database. In enterprise environments, databases are frequently shared between many different apps.

mnahkies • 6 days ago

Sure, and in those cases you'd typically have a dedicated repository storing said schema and migrations.

It's important to manage your schema in code for various reasons from change control to standing up development databases, etc.

koolba • 7 days ago

It makes it trivial to have pretty diffs of the net result of migrations.

wredcoll • 7 days ago

That "works" for about as long as you have <10 employees and <3 customers or so. After that the railsapp doesn't get to be the sole owner of the db.

bigfatkitten • 6 days ago

Providing access to other services is what APIs are for.

Jeff Bezos famously said[1] that anyone who does otherwise should be fired, and I agree.

[1] https://news.ycombinator.com/item?id=18916406

bsaul • 6 days ago

the reason he did that was for business reasons. He wanted to be able to expose any part of the stack as a public service to external customers, and vice versa, to let his internal service compete against the publicly available ones.

But this is only valid when you're trying to build AWS. Not everyone does that.

Relational databases have extensive permission systems for a reason.

cowsandmilk • 6 days ago

The distributed computing manifesto for Amazon is from 1998. Jeff was not thinking about those things then. And wouldn’t for 5 more years.

int_19h • 6 days ago

SQL is also an API.

dragonwriter • 6 days ago

> Providing access to other services is what APIs are for.

Yeah, and databases expose extensive APIs with extremely battle-tested security and permissions models for multiple consumers.

bigfatkitten • 6 days ago

That’s why your bank lets you connect directly to their database to see your transaction history.

YorickPeterse • 7 days ago

It worked fine for GitLab when it had 2000+ employees and god knows how many customers. The same applies to many other large Rails shops.

sbarre • 7 days ago

Do you have multiple separate apps that can change a shared DB schema?

How do you keep that all in sync across your apps?

scott_w • 6 days ago

So usually one is the “main” but, personally speaking, I’d just say “don’t do it,” for the obvious issues I suspect you’re aware of ;-)

By “don’t do it,” I mean having multiple apps talk to one DB schema.

throwaway7783 • 6 days ago

"don't do it" is the right answer. Others have pointed it out as well, many large SaaS companies I worked with, have had apps owning their databases. Anyone else needs anything - use APIs (and ETL if you need everything)

ecb_penguin • 6 days ago

It's definitely not the right answer. It's actually the completely wrong answer.

Services are slow, restrictive, and don't enjoy the benefits of an actual DBMS, like transactions. You also add additional dependencies and failure points.

scott_w • 5 days ago

This isn’t a discussion of whether you should use services or not, only what to do in the case where you do have separate services interacting with a schema. And, in this case, trying to maintain consistency with multiple apps accessing the DB directly can quickly turn into a nightmare. Better to maintain separate services interacting over some contract to insulate multiple codebases from the internals of each other.

ecb_penguin • 3 days ago

> Better to maintain separate services interacting over some contract to insulate multiple codebases from the internals of each other.

All you've done is shifted the dependency and made extra work when schemas change. Now we have schema updates, _and_ service updates.

As I said, you give up transactions, performance, and flexibility. You also increase the workload, and increase failures. There is almost no good reason for what you're proposing.

throwaway7783 • 3 days ago

I am not sure I understand. The context was multiple applications managing/accessing the same database. No matter how you slice it, a schema update is accompanied by application updates. If anything, in a service oriented approach, you can potentially isolate the impact by versioning APIs and writing compatibility logic in a single place.

ecb_penguin • 3 days ago

> a schema update is accompanied by application updates

1. DB -> App = Two updates, the schema and the app

2. DB -> Service -> App = Three updates, the schema, the service, and the app

In both cases, obviously the DB changes.

In both cases, obviously the App changes.

In the second case, you also get to update your services.

scott_w • 2 days ago

No, it’s not obvious the DB changes because the team depending on that schema doesn’t necessarily talk to your team much, and you might not even know that team exists.

The same is true for API contracts but the culture of changing an API is much more understood as something that must be communicated and processes are more rehearsed.

ecb_penguin • 2 days ago

> No, it’s not obvious the DB changes because the team depending on that schema doesn’t necessarily talk to your team much, and you might not even know that team exists.

Literally everything you just said applies to services and API contracts as well. You haven't solved anything.

> the culture of changing an API is much more understood as something that must be communicated and processes are more rehearsed

You're just making this up. There are thousands of incidents every day because someone changed an API without communicating to every stakeholder.

"Services are better than databases because service owners are better at communicating changes than database owners" isn't a very compelling argument.

I'd work to fix the communication problem before I adopted a terrible tech solution.

scott_w • 2 days ago

>> the culture of changing an API is much more understood as something that must be communicated and processes are more rehearsed

> You're just making this up. There are thousands of incidents every day because someone changed an API without communicating to every stakeholder.

What you wrote isn't a contradiction of what I wrote. I wrote "more understood" not "perfectly solved problem."

> "Services are better than databases because service owners are better at communicating changes than database owners" isn't a very compelling argument.

Except they are because the tooling is better. I can check calls to an API endpoint using the same tooling that I would for checking end user API calls. I can push out comms to use the new API, I can confirm calls go to 0 before deleting the old endpoint.

For checking that a table or column is being accessed before changing / deleting it? Fuck, I don't actually know how I'd orchestrate that. And once I've made that change (vs when I make an API change), well, I hope I got it 100% correct because reverting that change is harder.

throwaway7783 • 2 days ago

It is more that services allow more backward compatibility than direct DB changes, because you can also code additional logic to maintain that.

Of course in a narrow case where all apps that need the same DB need to be upgraded/fixed in lockstep, maybe you can get away by directly updating the database. But more than 2 teams, it will quickly be a nightmare of conflicting priorities across teams, amount deployment coordination required.

"Oh you cannot change this schema right now because 3 other apps depend on it and they dont have time to make the necessary changes in their app to accommodate this right now, for the next 6 months"

scott_w • 2 days ago

Except if your schema changes in a way that forces your API contract to change, you have a big red alarm to tell you to investigate what’s using that contract and update it too.

When your apps connect to a DB schema, the change is silent. You just run ALTER TABLE and… whoops, you just brought down a service you didn’t know existed because the contract was implicit.

int_19h • 6 days ago

In situations like these, the database admins are the ones responsible for the schema; the apps are mere users.

netghost • 7 days ago

That seems like a really pragmatic tool, thanks for sharing it!

I'm curious, do you output triggers, store procedures, and such? Many tools seem to stop after you've defined tables, columns, and indices, but I'd love some better tooling to make use of the rest of the DB's features.

seveibar • 7 days ago

Yep! It basically runs pg_dump and categorizes all of the output into different files so it should be comprehensive. I think there's `functions/function_name.sql`, `misc.sql`, `triggers.sql` etc.

netghost • 2 days ago

Thanks!

And curses. Now I'm going down another rabbit hole rehashing ways to interface with my database in a side project that is doomed to never get past the "there has to be a better way to…" phase ;)

what • 6 days ago

You built it but you don’t know what it outputs?

zX41ZdbW • 6 days ago

Wow, this is precisely how ClickHouse stores table metadata! A set of .sql files in the directories, corresponding to databases.

jsmith99 • 7 days ago

I just use a MCP server (with copilot or cline) that has a read only login to my database.

BoorishBears • 7 days ago

Which is strictly worse than just giving the LLM access to the source of truth for the database.

You're adding a round trip to the database and the LLM and inserting a tool call in the conversation before it even starts generating any code.

And the reference Postgres MCP implementation doesn't include Postgres types or materialized views, and is one of the most widely used packages: Zed.dev's MCP server for example, is seemingly just a port of it and has the same problem.

fwip • 7 days ago

MCP also gives the LLM access to your example data, which can add clarity beyond what your schema alone provides.

tempaccount420 • 7 days ago

I don't see how a round trip of <500ms, which is equivalent to maybe 50 tokens, is worse than including many thousands more extra tokens in the prompt, just in case they might be useful. Not to mention the context fatigue.

If designed well - by suspending generation in memory and inserting a <function_result>, without restarting generation and fetching cache from disk - the round trip/tool call is better (costs the equivalent of 50 tokens for waiting + function_result tokens).

BoorishBears • 6 days ago

You're dealing with the full TTFT x2 + the tokens all the prompts of all your MCPs before you even get to that round trip to the DB.

And you don't have to wonder about "if designed well": the reference implementation that's getting 20k downloads a week and getting embedded in downstream editors is is not designed well and will make the round trip every time and still not give the LLM the full information of the table.

Most MCP implementations are crappy half-assed implementations in similar fashion because everyone was rushing to post how they added <insert DB/API/Data Source> to MCP.

And if you're worried about "context fatigue" (you mean LLMs getting distracted by relevant information...), you should 100% prefer a well known schema format to N MCP prompt definitions with tool usage instructions that weren't even necessarily tuned for the LLM in question.

LLMs are much more easily derailed by the addition of extra tools and having to reason about when to call them and the results of calling them, than they are a prompt caching friendly block of tokens with easy to follow meaning.

nsonha • 7 days ago

The schema in the db should be the source of truth and an MCP server like that is the most flexible, can work with any ORM set up

wredcoll • 7 days ago

What source of truth? If you have access to the database then you have the actual truth right there.

layoric • 7 days ago

Out of interest.. does the resultant data get used by the LLM or just generating SQL, executing and returning separately?

maxluk • 7 days ago

PM on the project here - The results from the query are generally not used by the LLM. In agent mode though, during query planning, the agent may retrieve sample of the data to improve precision of the queries. For example, getting distinct values from dimensional table to resolve filter condition from natural language statement.

layoric • 7 days ago

Thanks. I worry about these kind of tools connecting to production databases.. Especially considering how easy it is to switch out LLM endpoints, where that data is going, how it is retained, the context etc becomes a bit of a privacy nightmare..

maxluk • 7 days ago

Absolutely valid concern. Our extension connects to LLMs through Github Copilot. Github Copilot is Microsoft product and offers variety of enterprise plans, which enables your IT to approve what can be used for what kind of data. This gives you a clear path towards compliance with your enterprise requirements.

layoric • 7 days ago

Makes sense. Appreciate the responses. Honestly though, as a person outside the US, I'm removing my dependence on US company IT tools and infrastructure, GitHub, VSCode, AWS etc, enterprise or otherwise.. Congrats on the project though.