Kokoro just proves my point; it's "one guy in a garage", 1000 hours of distilled audio (I think) and ~100m params.
With the budget one tenth that of Stable Diffusion and less ethical qualms, you could easily 10x or 100x this.
I'm actually surprised people aren't just using elevenreader to generate solid content from various books for datasets lol