Abstract
Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems

Expressiveness and contextual coherence
We compare the synthesized speech generated by the proposed model with baseline models to highlight improvements in expressiveness and contextual coherence.
Sample 1
Context
W: Umm I don’t want to.
M: Well, come and talk to me then.
W: Certainly not.
M: May I turn on the radio then?
Current utterance
W: Turn on the radio? What for?
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“Turn on the radio? What for?” | “Turn on the radio? What for?” | “Turn on the radio? What for?” | “Turn on the radio? What for?” |
Sample 2
Context
M: Umm, where did you go yesterday?
W: I went to Croydon.
M: Did you go shopping?
W: No, I went for an interview.
Current utterance
M: Oh, did you get a job?
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“Oh, did you get a job?” | “Oh, did you get a job? “ | “Oh, did you get a job? “ | “Oh, did you get a job? “ |
Sample 3
Context
M: You have a pet lizard? Somehow I never would have imagined that.
W: His name is Grunt. Come closer and I’ll properly introduce you.
M: Does it bite or scratch?
W: No, he’s perfectly harmless. And he’s not afraid of strangers either. Here, hold him.
Current utterance
M: Wow. He’s heavy! And his skin feels really cool.
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“Wow. He’s heavy! And his skin feels really cool.” | “Wow. He’s heavy! And his skin feels really cool.” | “Wow. He’s heavy! And his skin feels really cool.” | “Wow. He’s heavy! And his skin feels really cool.” |
Sample 4
Context
W: That sounds good. Let me see one.
M: Here’s the latest model — Digital Barbie.
W: Oh, she’s nice. How much is she?
M: Why, she’s only twenty nine ninety five dollars.
Current utterance
W: Well, that’s reasonable. I’ll take it.
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“Well, that’s reasonable. I’ll take it.” | “Well, that’s reasonable. I’ll take it.” | “Well, that’s reasonable. I’ll take it.” | “Well, that’s reasonable. I’ll take it.” |
Sample 5
Context
W: No, that won’t do. I’ll take this smoked ham you have here.
M: OK, umm is there anything else?
W: Is this salami and bologna you have here?
M: Yes! It’s very fine meat! Made it myself…
Current utterance
W: Sounds good. OK, that’s it.
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“Sounds good. OK, that’s it.” | “Sounds good. OK, that’s it.” | “Sounds good. OK, that’s it.” | “Sounds good. OK, that’s it.” |
Sample 6
Context
W: Hey, look out!
M: What happened?
W: You’ve just scratched my car. Oh, God, a paint was scratched off.
M: Where? my car?
Current utterance
W: No, mine!
Synthesized speech
GRU-based context modeling | DialogueGCN-based context modeling | Transformer encoder-based context modeling | Proposed |
---|---|---|---|
“No, mine!” | “No, mine!” | “No, mine!” | “No, mine!” |
Diversity
We demonstrate the diversity of the proposed method by sampling prosody embeddings multiple times from the same conversational context. The predicted prosody varies across different runs, resulting in diverse synthesized speech. At the same time, the predicted prosody remains consistent with the conversational context while maintaining strong expressiveness.
Sample 1
Context
W: Any day except Tuesday.
M: How about Thursday?
W: Yes, Thursday would be fine. What time shall I come?.
M: Oh, about six. Will that be OK?
Current utterance
W: Yes, of course. Thank you very much.
Transcript | Run 1 | Run 2 | Run 3 |
---|---|---|---|
‘Yes, of course. Thank you very much.’ |
Sample 2
Context
W: What’s good today?
M: Umm the salmon is good toady, it’s very fresh.
W: How is it done?
M: It’s cooked with lemon and savored with rice.
Current utterance
W: Sounds nice, I will try it.
Transcript | Run 1 | Run 2 | Run 3 |
---|---|---|---|
‘Sounds nice, I will try it.’ |
Sample 3
Context
M: Hi Melissa, are you going home this weekend?
W: No, not this weekend. I have too much work to do.
M: Where do your parents live?
W: My father lives in Washington DC.
Current utterance
W: How about your mother?
Transcript | Run 1 | Run 2 | Run 3 |
---|---|---|---|
‘How about your mother?’ |
Sample 4
Context
W: With pleasure. What color do you like?
M: I like yellow best. How much does it cost?
W: It costs two-seventy-five yuan, Mr.
M: It’s nice, but that’s very steep for a rain coat. Could you give me a twenty percent discount?
Current utterance
W: Sorry, we don’t give discounts.
Transcript | Run 1 | Run 2 | Run 3 |
---|---|---|---|
‘Sorry, we don’t give discounts.’ |