1
ipsum2 6 days ago

This uses GroundingDINO for open vocabulary, separate model. Useful nonetheless, but means you're running a lot of model inference for a single image.