You can build a pipeline where you use: GroundingDino (description to object detection) -> SAM (segmenting) -> Stable Diffusion model (inpainting, I do mainly real photo so I like to start with realisticVisionV60B1_v51HyperVAE-inpainting and then swap if I have some special use case)
For higher quality at a higher cost of VRAM, you can also use Flux.1 Fill to do inpainting.
Lastly, Flux.1 Kontext [dev] is going to be released soon and it promises to replace the entire flow (and with better prompt understanding). HN thread here: https://news.ycombinator.com/item?id=44128322
Thanks! I do use GroundingDino + SAM2, but haven't tried realisticVisionV60B1_v51HyperVAE-inpainting! Will do! And will try flux kontext too. Thanks!