Neat. Wonder how this compares to Segment Anything (SAM), which also does zero-shot segmentation and performs pretty well in my experience.
YOLO is way faster. We used to run both, with YOLO finding candidate bounding boxes and SAM segmenting just those.
For what it's worth, YOLO has been a standard in image processing for ages at this point, with dozens of variations on the algorithm (yolov3, yolov5, yolov6, etc) and this is yet another new one. Looks great tho
SAM wouldn't run under 1000ms per frame for most reasonable image sizes
Just as a quick demo, here is an example of YOLO-World combined with EfficientSAM: https://youtu.be/X7gKBGVz4vs?t=980
We used mobile Sam because of this, was about 250ms on cpu. Useful for our use case
SAM doesn't do open vocabulary i.e. it segments things without knowing the name of the object, so you can't ask it to do "highlight the grapes", you have to give it an example of a grape first.
This uses GroundingDINO for open vocabulary, separate model. Useful nonetheless, but means you're running a lot of model inference for a single image.