UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

0. Contents

Abstract
Automatic speaking style captioning
Zero-shot stylistic speech synthesis with prompt speech control
Zero-shot stylistic speech synthesis with text description control
Discussion - attribute control

1. Abstract

Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

Figure 1: Overview of the proposed UniStyle

2. Automatic speaking style captioning

We conduct automatic speaking style captioning that generates style captions for speech inputs. The following captions are generated by StyleCap, SECap and UniStyle-L, respectively.

StyleCap	SECap	UniStyle-L	Ground truth
With a stume at the wrong fives. She spoke slowly, her livid high-pitched voice resonating with energy. The startled dumbfounded girl's speech was slow and deliberate, yet full of life	With a normal pitch, the joyful man speaks at a leisurely pace, exuding low energy.	slowly and emphatically, the woman's high-energy speech held the attention of her happy audience.	The joyful woman's high-pitched voice carried a slow yet energetic speech.
The man with a deep-speffo. A talks in a regular rhythm, speaking at an average pace and displaying normal energy. It's important to maintain a healthy sense of speed while conveying the essence of	The man's low-pitched tone conveys high energy while speaking naturally at a moderate speed.	with a low pitch and customary speaking speed, the man's overall energy level appears subdued.	The man's low-pitched voice maintains an even speaking tempo, evoking a subdued vitality.
A fuming suivaiucal quasiwaura was at a normal. The woman spoke slowly, her revulsed voice carrying a high pitch and low energy. The dismissive girl's words were like a gentle breeze	With a high-pitched voice, the male speaker's unhurried speech conveys low energy. Speaking speed.	despite the low energy, her heartbroken voice sustains a high pitch at a normal pace.	She talks gently, yet her downcast pitch is high.
The man with a deep-speffims to speak in. His customary rush, delivers words with low energy at normal pitch and pace. It's not easy being a guy, as society tends to place high expectations	The man's low-pitched voice and standard speaking tempo contribute to his subdued energy.	the male speaker's low - pitched voice delivers his message with a moderate level of energy and a standard speed of speech.	The male speaker's low-pitched tone communicates naturally, neither too fast nor too slow, with moderate energy.

3. Zero-shot stylistic speech synthesis with prompt speech control

We conduct zero-shot text-to-speech that synthesizes the speech with not only the timbre of a speaker prompt but also the speaking style of a style prompt.

Speaker prompt	Style prompt	SC VALL-E	Vec-Tok Speech	UniStyle-P	UniStyle	UniStyle-L
Text: Receiving this unexpected good news, I can hardly believe it.

Text: Being alone often fills me with a deep sense of loneliness and helplessness.

Text: I feel angry and helpless witnessing such blatant disregard for the environment.

Text: Losing my beloved pet feels like my heart is being torn apart.

Text: Receiving this unexpected good news, I can hardly believe it.

Text: Being alone often fills me with a deep sense of loneliness and helplessness.

Text: I feel angry and helpless witnessing such blatant disregard for the environment.

Text: Losing my beloved pet feels like my heart is being torn apart.

4. Zero-shot stylistic speech synthesis with text description control

We conduct zero-shot text-to-speech that synthesizes the speech with the timbre of a speaker prompt, while the speaking style is controlled by a text description. We synthesize the same text "I am very busy, Cornelia, and Colonel Bethel is waiting; I will talk to you at dinner time." for a more intuitive comparison.

Speaker prompt	PromptStyle	Salle	UniStyle-P	UniStyle	UniStyle-L
Text description: A man with an energetic, high-pitched voice speaks fast.

Text description: He communicated in a reserved tone, his sad speech unfolding slowly and softly.

Text description: With a low-pitched tone, the male speaker's fast speech emanates vibrant energy.

Text description: With a slow speaking speed and a low-pitched voice, the despondent male speaker's audio style conveys a sense of low energy.

Text description: A woman with an energetic, high-pitched voice speaks fast.

Text description: She communicated in a reserved tone, her sad speech unfolding slowly and softly.

Text description: With a low-pitched tone, the female speaker's fast speech emanates vibrant energy.

Text description: With a slow speaking speed and a low-pitched voice, the despondent female speaker's audio style conveys a sense of low energy.

5. Discussion - attribute control

We evaluate the effectiveness of UniStyle-L from the perspective of attribute control. As we can hear, the gender control does not work well. This discrepancy can be attributed to the fact that gender is highly correlated with speaker timbre, which is primarily determined by the acoustic prompt rather than the style prompt. In the context of zero-shot TTS, where the synthetic speaker timbre aligns closely with that of the speaker prompt, such a phenomenon is expected and reasonable.

Attribute	Speaker prompt	Style prompt
Gender	Text: Downs are sitting by the door.
		A man talks
		A woman talks
		A man talks
		A woman talks
Speaking speed	Text: And yesterday things went on just as usual.
		A speaker talks with a slow rate.
		A speaker talks with a standard rate.
		A speaker talks with a fast rate.
		A speaker talks with a slow rate.
		A speaker talks with a standard rate.
		A speaker talks with a fast rate.
Pitch	Text: Receiving this unexpected good news, I can hardly believe it.
		A speaker talks with a low pitch.
		A speaker talks with a standard pitch.
		A speaker talks with a high pitch.
		A speaker talks with a low pitch.
		A speaker talks with a standard pitch.
		A speaker talks with a high pitch.
Energy	Text: Losing my beloved pet feels like my heart is being torn apart.
		Lack of energy, A speaker talks.
		A speaker talks with a normal level of energy.
		Full of energy, A speaker talks.
		Lack of energy, A speaker talks.
		A speaker talks with a normal level of energy.
		Full of energy, A speaker talks.
Emotion	Text: I am very busy, Cornelia, and Colonel Bethel is waiting; I will talk to you at dinner time.
		A enraged speaker talks.
		A speaker talks with terrified tone.
		A speaker talks with a pleasing voice.
		A speaker talks with a sorrowful voice.
		A enraged speaker talks.
		A speaker talks with terrified tone.
		A speaker talks with a pleasing voice.
		A speaker talks with a sorrowful voice.