Try this: https://github.com/luca-medeiros/lang-segment-anything
This uses GroundingDINO for open vocabulary, separate model. Useful nonetheless, but means you're running a lot of model inference for a single image.