wodenokoto 6 days ago

I’m building a poor man’s datalake at work, basically putting parquet files in blob storage using deltalake-rs’ python bindings and duck db for querying.

However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.

But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.

2
isoprophlex 5 days ago

Add a randomly chosen suffix to your filenames?

wodenokoto 5 days ago

That doesn’t change the manifest, which keeps tabs of which rows are current and which are soft deleted, making time travel possible.

ed_elliott_asc 5 days ago

Take a lease on the json file before you attempt the write and queue writes that way

wodenokoto 5 days ago

What does the worker that tries to commit do when the json manifest is locked? Wait and try again?