Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

0. Contents

  1. Abstract
  2. Demos - multilingual expressive speech with a specific style and emotion for target speakers


1. Abstract

This paper aims to build a Mutli-speaker, Multi-emotion, Multi-style, Multilingual (M4) TTS. We address this challenging task with a semi-supervised contrastive learning-based TTS framework. To effectively capture and disentangle speaker, style and emotion representations, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the desired representations from speech. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on monolingual and multilingual corpora demonstrate the effectiveness of the proposed method.

描述图片的文字1

Fig.1. The architecture of speech representation learning module.

描述图片的文字2

Fig.2. The architecture of multi-speaker expressive VITS.



2. Demos - multilingual expressive speech with a specific style and emotion for target speakers.

Convert the emotion and style expresssions from different source speakers to the neutral target speakers without emotional and stylistic training data

Chinese speaker: F1

Emotion Target style example Target emotion example Target speaker example TSEW CGCLONE Proposed
Poet + Happy Text: 长风破浪会有时,直挂云帆济沧海。(English: Someday, with my sail piercing the clouds; I will mount the wind, break the waves, and traverse the vast, rolling sea.)
Poet + Sad Text: 君不见,高堂明镜悲白发,朝如青丝暮成雪。(English: Don't you see? sadly, the high hall mirror saw white hair, as black as green hair in the morning and as white as snow in the evening.)
Poet + Angry Text: 天生我材必有用,千金散尽还复来。(English: Heaven has made us talents, we're not made in vain. A thousand gold coins spent, more will turn up again.)
Poet + Surprise Text: 从此山河两相阅,缠尽青山尝清河。(English: From then on, the mountains and rivers read each other and wrapped around the green mountains to taste the Qinghe River.)
Poet + Fear Text: 山不厌高,海不厌深。周公吐哺,天下归心。(English: The water can never be too deep, and the mountain too high. The Duke of Zhou spits and feeds, and the world returns to the heart.)
Poet + Neutral Text: 文章本天成,妙手偶得之。(English: The essence of writing is inherently divine, skillful hands occasionally chance upon it.)
Fairytales + Happy Text: 不久以后,王后果然生下了一个可爱的小公主。(English: Before long, the queen gave birth to a lovely little princess.)
Fairytales + Sad Text: 她对恶魔说,求求你,千万不要伤害公主,我什么都可以给你。(English: She said to the devil, Please, don't hurt the princess. I can give you anything.)
Fairytales + Angry Text: 给我杀了她!把她的心和舌头都带回来,做为你杀死她的证据。(English: Kill her! Bring back her heart and tongue as proof that you killed her.)
Fairytales + Surprise Text: 进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly arranged inside.)
Fairytales + Fear Text: 听到猫头鹰叫声的白雪公主,越走越觉得森林好可怕。(English: Snow White, who heard the cry of an owl, felt more and more terrible in the forest as she walked.)
Fairytales + Neutral Text: 春光明媚,碧波荡漾,王后带着公主在湖边愉快地玩耍。(English: The spring is bright and beautiful, with rippling azure waves. The queen, accompanied by the princess, joyfully plays by the lakeside.)
English + Happy Text: I'll build a house out of candy and gingerbread!
English + Sad Text: Hope is the thing with feathers that perches in the soul.
Englsih + Angry Text: Never give up, Never lose hope.
English + Surprise Text: The fairytale is over now, the happy ending we must make.
English + Fear Text: I can't go back to yesterday because I was a different person then.
English + Neutral Text: To be or not to be, that is the question.

English speaker: M1

Emotion Target style example Target emotion example Target speaker example TSEW CGCLONE Proposed
Poet + Happy Text: 长风破浪会有时,直挂云帆济沧海。(English: Someday, with my sail piercing the clouds; I will mount the wind, break the waves, and traverse the vast, rolling sea.)
Poet + Sad Text: 君不见,高堂明镜悲白发,朝如青丝暮成雪。(English: Don't you see? sadly, the high hall mirror saw white hair, as black as green hair in the morning and as white as snow in the evening.)
Poet + Angry Text: 天生我材必有用,千金散尽还复来。(English: Heaven has made us talents, we're not made in vain. A thousand gold coins spent, more will turn up again.)
Poet + Surprise Text: 从此山河两相阅,缠尽青山尝清河。(English: From then on, the mountains and rivers read each other and wrapped around the green mountains to taste the Qinghe River.)
Poet + Fear Text: 山不厌高,海不厌深。周公吐哺,天下归心。(English: The water can never be too deep, and the mountain too high. The Duke of Zhou spits and feeds, and the world returns to the heart.)
Poet + Neutral Text: 文章本天成,妙手偶得之。(English: The essence of writing is inherently divine, skillful hands occasionally chance upon it.)
Fairytales + Happy Text: 不久以后,王后果然生下了一个可爱的小公主。(English: Before long, the queen gave birth to a lovely little princess.)
Fairytales + Sad Text: 她对恶魔说,求求你,千万不要伤害公主,我什么都可以给你。(English: She said to the devil, Please, don't hurt the princess. I can give you anything.)
Fairytales + Angry Text: 给我杀了她!把她的心和舌头都带回来,做为你杀死她的证据。(English: Kill her! Bring back her heart and tongue as proof that you killed her.)
Fairytales + Surprise Text: 进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly arranged inside.)
Fairytales + Fear Text: 听到猫头鹰叫声的白雪公主,越走越觉得森林好可怕。(English: Snow White, who heard the cry of an owl, felt more and more terrible in the forest as she walked.)
Fairytales + Neutral Text: 春光明媚,碧波荡漾,王后带着公主在湖边愉快地玩耍。(English: The spring is bright and beautiful, with rippling azure waves. The queen, accompanied by the princess, joyfully plays by the lakeside.)
English + Happy Text: I'll build a house out of candy and gingerbread!
English + Sad Text: Hope is the thing with feathers that perches in the soul.
English + Angry Text: Never give up, Never lose hope.
English + Surprise Text: The fairytale is over now, the happy ending we must make.
English + Fear Text: I can't go back to yesterday because I was a different person then.
English + Neutral Text: To be or not to be, that is the question.