tyrauber 5 days ago

Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

5
jinay 5 days ago

For anyone who wants to listen, it's on this page: https://yummy-fir-7a4.notion.site/dia

mrandish 5 days ago

Wow. Thanks for posting the direct link to examples. Those sound incredibly good and would be impressive for a frontier lab. For two people over a few months, it's spectacular.

DoctorOW 5 days ago

A little overacted, it reminds me of the voice acting in those flash cartoons you'd see in the early days of YouTube. That's not to say it isn't good work, it still sounds remarkably human. Just silly humans :)

3by7 5 days ago

Overacted and silly humans indeed: https://www.youtube.com/watch?v=gO8N3L_aERg

Cthulhu_ 5 days ago

"flash cartoons in the early days of Youtube" Wouldn't those be straight from Newgrounds?

DoctorOW 3 days ago

Thank you! I couldn't remember the name Newgrounds for some reason!!

selimthegrim 5 days ago

Reminded me of the Fenslerfilm G.I. Joe sketch where the kids have something on the stove burning

wisemang 5 days ago

Stop all the downloading!

dostick 5 days ago

This is an instant classic. Sesame comparison examples all sound like clueless rich people from The White Lotus.

intalentive 4 days ago

Sounds great. One of the female examples has convincing uptalk. There must be a way to manipulate the latent space to control uptalk, vocal fry, smoker’s voice, lispiness, etc.

toebee 5 days ago

Thank you!! Indeed the script was inspired from a scene in the Office.

3abiton 5 days ago

This is oddly reminiscent of the office. I wonder if tv shows were part of its training data!

nojs 5 days ago

This is so good. Reminds me of The Office. I love how bad the other examples are.

fwip 5 days ago

The text is lifted from a scene in The Office: https://youtu.be/gO8N3L_aERg?si=y7PggNrKlVQm0qyX&t=82

hombre_fatal 4 days ago

Yeah, that example is insane.

Is there some sort of system prompt or hint at how it should be voiced, or does it interpret it from the text?

Because it would be hilarious if it just derived it from the text and it did this sort of voice acting when you didn't want it to, like reading a matter-of-fact warning label.