SAM doesn't do open vocabulary i.e. it segments things without knowing the name of the object, so you can't ask it to do "highlight the grapes", you have to give it an example of a grape first.
This uses GroundingDINO for open vocabulary, separate model. Useful nonetheless, but means you're running a lot of model inference for a single image.