For all of these reasons, audio snooping is much more likely to be something done by wired, stationary devices that maybe have a decent amount of RAM + a fair bit of usually-idle processing capacity (to run the transcription model locally and just push the resulting text), and which are expected to draw a decent amount of power and use the Internet at vaguely-arbitrary times.
Like a smart TV, for example.
First thing I do is disable that feature on every TV I buy.
Second thing I do is block the TV access to internet after I do one firmware update.