Item 43754125

toebee • 5 days ago

Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

dangoodmanUT • 5 days ago

I know it’s taboo to ask, but I must: where’s the dataset from? Very eager to play around with audio models myself, but I find existing datasets limiting

2 replies

zelphirkalt • 5 days ago

Why would that be a taboo question to ask? It should be the question we always ask, when presented with a model and in some cases we should probably reject the model, based on that information.

1 reply

dangoodmanUT • 5 days ago

Because generally the person asking this question is trying to cancel the model maker

3 replies

tough • 5 days ago

or by replying you expose yourself to handing -proof- of the origins of the training data set to the copyright owner wanting to sue you next

fennecfoxy • 3 days ago

Well presumably since they're individuals and not a business the consequences are much less severe legally - but public opinion still won't be great, but since when was it ever, for any new thing?

If I cut up a song or TV show & put it on Youtube (and screech about fair use/parody law) then that's fine, but people will balk at something like this.

AI is here, people.

deng • 5 days ago

No. It's for giving credit where credit is due. And yes, that includes the question if the people who generated the training data in the first place have given their consent that this can be used for AI training.

It's quite concerning that the community around here is usually livid about FOSS license violations, which typically use copyright law as leverage, but somehow is perfectly OK with training models on copyrighted work and just labels that as "fair use".

1 reply

isaacfung • 2 days ago

What AI tools have you used recently? Have you verified if they all use models trained on copyrighted material with permission?

1 reply

deng • 1 day ago

Ah, that's a classic. "How can you criticize Big Oil and at the same time drive a car!" and voila, the case is closed.

I am allowed to criticize things without having to live like a hermit. I make moderate use of ChatGPT, yet at the same time I think that its training does not fall under fair use, and that creators should get compensated. If OpenAI's business model does not allow for this, then it should fail, and that's fine by me. I lived without ChatGPT, and I can live without it again.

xdfgh1112 • 4 days ago

I suspect podcasts, as you have a huge amount of transcribed data with good diction and mic quality. The voices sound like podcast voices to me.

gfaure • 5 days ago

Amazing that you developed this over the course of three months! Can you drop any insight into how you pulled together the audio data?

1 reply

isoprophlex • 5 days ago

+1 to this, amazing how you managed to deliver this, and iff you're willing to share i'd be most interested in learning what you did in terms of train data..!

heystefan • 5 days ago

Could one usecase be generating an audiobook with this from existing books? I wonder if I could fine-tune the "characters" that speak these lines since you said it's a single pass whole the whole convo. Wonder if that's a limitation for this kind of a usecase (where speed is not imperative).

1 reply

toebee • 5 days ago

Yes! But you would need to put together a LLM system that created scripts from the book content. There is an open source project called OpenNotebookLM (https://github.com/gabrielchua/open-notebooklm) that does something similar. If you hook the Dia model to that kind of system, it will be very possible :) Thanks for the interest!

1 reply

satvikpendem • 5 days ago

Another project, specifically for creating audiobooks: https://github.com/prakharsr/audiobook-creator

smusamashah • 5 days ago

Hi! This is awesome for size and quality. I want to see a book reading example or try it myself.

This is a tangent point but it would have been nicer if it wasn't a notion site. You could put the same page on github pages and it will be much lighter to open, navigate and link (like people trying to link some audio)

1 reply

toebee • 5 days ago

Thanks for the kind words! You can try it now on https://huggingface.co/spaces/nari-labs/Dia-1.6B Also, we'll try to update the Demo Page to something lighter when we have time. Thanks for the feedback :))

karimf • 5 days ago

This is super awesome. Several questions.

1. What GPU did you use to train the model? I'd love to train a model like this, but currently, I only have a 16GB MacBook. Thinking about buying a 5090 if it's worth.

2. Is it possible to use this for real time audio generation, similar to the demo on the Sesame website?

cchance • 5 days ago

Its really amazing cant wait to play with it some, the samples are great... but oddly all seem... really fast, like they'd be perfect but they feel like they're playing at 1.2x speed or is that just me?

1 reply

claiir • 5 days ago

It’s not just you. The speedup is an artefact of the CFG (Classifier-Free Guidance) the model uses. The other problem is the speedup isn’t constant—it actually accelerates as the generation progresses. The Parakeet paper [1] (which OP lifted their model architecture almost directly from [2]) gives a fairly robust treatment to the matter:

> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]

Parakeet details a solution to this, though this was not adopted (yet?) by Dia:

> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]

The paper contains audio samples with ablations you can listen to.

[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...

[2]: https://news.ycombinator.com/item?id=43758686

llm_nerd • 5 days ago

This is a pretty incredible three month creation for a couple of people who had no experience with speech models.

1 reply

toebee • 5 days ago

Thanks for the kind words! We're just following our interests and staying upwind.

amp-lifier • 4 days ago

In terms of guiding voice and expression, audio prompts are promising but I believe text instructions serve different experiences as well. Will there be support for that as well?

new_user_final • 5 days ago

Easily 10 times better than recent OpenAI voice model. I don't like robotic voices.

Example voices seems like over loud, over excitement like Andrew Tate, Speed or advertisement. It's lacking calm, normal conversation or normal podcast like interaction.

1 reply

toebee • 5 days ago

Thank you! You can add audio prompts of calm voices to make them a bit smoother. https://huggingface.co/spaces/nari-labs/Dia-1.6B you can try it here!

nickthegreek • 5 days ago

Are there any examples of the audio differences between the this and the larger model?

1 reply

toebee • 5 days ago

We're still experimenting, so do not have samples yet from the larger model. All we have is Dia-1.6B at the moment.

1 reply

cchance • 5 days ago

I didn't see or missed it are you planning to release the larger model as well?

bzuker • 5 days ago

hey, this looks (or rather, sounds) amazing! Does it work with different languages or is it English only?

1 reply

toebee • 5 days ago

Thank you!! Works for English only unfortunately :((