I’m building a poor man’s datalake at work, basically putting parquet files in blob storage using deltalake-rs’ python bindings and duck db for querying.
However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.
But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.
Add a randomly chosen suffix to your filenames?
That doesn’t change the manifest, which keeps tabs of which rows are current and which are soft deleted, making time travel possible.
Take a lease on the json file before you attempt the write and queue writes that way
What does the worker that tries to commit do when the json manifest is locked? Wait and try again?