DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

Submitted to ICASSP 2025

Abstract

Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems


Fig.1: Overall architecture of the proposed CSS framework

Expressiveness and contextual coherence

We compare the synthesized speech generated by the proposed model with baseline models to highlight improvements in expressiveness and contextual coherence.

Sample 1

Context

W: Umm I don’t want to.

M: Well, come and talk to me then.

W: Certainly not.

M: May I turn on the radio then?

Current utterance

W: Turn on the radio? What for?

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“Turn on the radio? What for?” “Turn on the radio? What for?” “Turn on the radio? What for?” “Turn on the radio? What for?”

Sample 2

Context

M: Umm, where did you go yesterday?

W: I went to Croydon.

M: Did you go shopping?

W: No, I went for an interview.

Current utterance

M: Oh, did you get a job?

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“Oh, did you get a job?” “Oh, did you get a job? “ “Oh, did you get a job? “ “Oh, did you get a job? “

Sample 3

Context

M: You have a pet lizard? Somehow I never would have imagined that.

W: His name is Grunt. Come closer and I’ll properly introduce you.

M: Does it bite or scratch?

W: No, he’s perfectly harmless. And he’s not afraid of strangers either. Here, hold him.

Current utterance

M: Wow. He’s heavy! And his skin feels really cool.

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“Wow. He’s heavy! And his skin feels really cool.” “Wow. He’s heavy! And his skin feels really cool.” “Wow. He’s heavy! And his skin feels really cool.” “Wow. He’s heavy! And his skin feels really cool.”

Sample 4

Context

W: That sounds good. Let me see one.

M: Here’s the latest model — Digital Barbie.

W: Oh, she’s nice. How much is she?

M: Why, she’s only twenty nine ninety five dollars.

Current utterance

W: Well, that’s reasonable. I’ll take it.

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“Well, that’s reasonable. I’ll take it.” “Well, that’s reasonable. I’ll take it.” “Well, that’s reasonable. I’ll take it.” “Well, that’s reasonable. I’ll take it.”

Sample 5

Context

W: No, that won’t do. I’ll take this smoked ham you have here.

M: OK, umm is there anything else?

W: Is this salami and bologna you have here?

M: Yes! It’s very fine meat! Made it myself…

Current utterance

W: Sounds good. OK, that’s it.

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“Sounds good. OK, that’s it.” “Sounds good. OK, that’s it.” “Sounds good. OK, that’s it.” “Sounds good. OK, that’s it.”

Sample 6

Context

W: Hey, look out!

M: What happened?

W: You’ve just scratched my car. Oh, God, a paint was scratched off.

M: Where? my car?

Current utterance

W: No, mine!

Synthesized speech

GRU-based context modeling DialogueGCN-based context modeling Transformer encoder-based context modeling Proposed
“No, mine!” “No, mine!” “No, mine!” “No, mine!”

Diversity

We demonstrate the diversity of the proposed method by sampling prosody embeddings multiple times from the same conversational context. The predicted prosody varies across different runs, resulting in diverse synthesized speech. At the same time, the predicted prosody remains consistent with the conversational context while maintaining strong expressiveness.

Sample 1

Context

W: Any day except Tuesday.

M: How about Thursday?

W: Yes, Thursday would be fine. What time shall I come?.

M: Oh, about six. Will that be OK?

Current utterance

W: Yes, of course. Thank you very much.

Transcript Run 1 Run 2 Run 3
‘Yes, of course. Thank you very much.’

Sample 2

Context

W: What’s good today?

M: Umm the salmon is good toady, it’s very fresh.

W: How is it done?

M: It’s cooked with lemon and savored with rice.

Current utterance

W: Sounds nice, I will try it.

Transcript Run 1 Run 2 Run 3
‘Sounds nice, I will try it.’

Sample 3

Context

M: Hi Melissa, are you going home this weekend?

W: No, not this weekend. I have too much work to do.

M: Where do your parents live?

W: My father lives in Washington DC.

Current utterance

W: How about your mother?

Transcript Run 1 Run 2 Run 3
‘How about your mother?’

Sample 4

Context

W: With pleasure. What color do you like?

M: I like yellow best. How much does it cost?

W: It costs two-seventy-five yuan, Mr.

M: It’s nice, but that’s very steep for a rain coat. Could you give me a twenty percent discount?

Current utterance

W: Sorry, we don’t give discounts.

Transcript Run 1 Run 2 Run 3
‘Sorry, we don’t give discounts.’