> This is indeed twice as fast as the vectorized implementation, but, disappointingly, the naive implementation with loops is even faster.
On CPU or GPU?
This is NumPy we are discussing. It doesn't use the GPU.