I'm struggling to see how this could give decent compression performance.
Compression works by exploiting patterns or tendencies in the input. One of the biggest and simplest patterns in video data is that pixel changes tend to be localised: E.g., a person waving their arm against a static background changes a clump of pixels in a small region only. By hashing positions, you immediately throw away this locality information -- under the proposed scheme, changing 400 pixels confined to a 20x20 rectangle requires ~exactly as much space as changing 400 pixels randomly scattered around the image, while a more efficient representation could record the top-left coords, width and height of the box, and save about 398 copies of pixel coordinates. And yes, this benefit remains even if we don't record the coords explicitly but rather discover them by querying a Bloom filter.
And that's to say nothing of more advanced video compression techniques like motion detection (make the rectangle here look like the rectangle there from the previous frame), which is an absolute staple of every practical codec.
I think you meant video decompression? Compression should theoretically be a lot more efficient, since the algorithm is a lot simpler and you can probably do a lot of SIMD to calculate the filters themselves.
In compression, "efficiency" usually means "compression level".
If by "efficiency" you mean "speed", then yes, I think OP's approach can be much, much faster than usual video compression approaches -- but this is true of all compression algorithms that don't compress very well. The fastest such algorithm -- leaving the input unchanged -- is infinitely fast, but achieves no compression at all.