Autoregressive Speech Synthesis with Next-Distribution Prediction

Xinfa Zhu, Wenjie Tian, and Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi'an, China

0. Contents

  1. Abstract
  2. Zero-shot TTS on the libriTTS teset-clean set
  3. Zero-shot emotion cloning TTS on the ESD corpus
  4. Zero-shot accent cloning TTS on the VCTK corpus
  5. Unconditional TTS
  6. Celebrity imitation

Work in progress. This page is for research demonstration purposes only.

1. Abstract

We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS.

Overview of the proposed KALL-E

Figure 1: The overview of KALL-E. Unlike discrete tokens-based language modeling approaches, KALL-E generates continuous speech distributions conditioned on input texts and acoustic prompts, using a single-stage decoder-only model as its foundational structure.



2. Zero-shot TTS on the libriTTS teset-clean set

We conduct text-to-speech synthesis on the libriTTS test-clean set to show the capability of zero-shot voice cloning.

Prompt Speech YourTTS VALL-E NaturalSPeech 2 CosyVoice KALL-E
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.


3. Zero-shot emotion cloning TTS on the ESD corpus

We conduct text-to-speech synthesis on the ESD corpus to show the capability of zero-shot emotion cloning.

Prompt Speech YourTTS VALL-E NaturalSPeech 2 CosyVoice KALL-E
Text: Old women always like rascals.
Text: The applause continued so long that the comte had ample leisure to join the king.
Text: He worked me very hard; he wanted to be beating me all the time.
Text: Yes; but perhaps I frightened her.
Text: Old women always like rascals.
Text: The applause continued so long that the comte had ample leisure to join the king.
Text: He worked me very hard; he wanted to be beating me all the time.
Text: Yes; but perhaps I frightened her.


4. Zero-shot accent cloning TTS on the VCTK corpus

We conduct text-to-speech synthesis on the VCTK corpus to show the capability of zero-shot accent cloning.

Prompt Speech YourTTS VALL-E NaturalSPeech 2 CosyVoice KALL-E
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.


5. Unconditional TTS

We conduct unconditional text-to-speech synthesis. Due to sampling from latent speech distributions, KALL-E can generate diverse speech with various speaker timbres and speaking styles.

Text transcripts Synthetic Speech
I didn't know the way to come.
Almost instantly he was forced to the top.




6. Celebrity imitation

KALL-E can imitate the voices of celebrities. We present these examples for purely research purposes.

Text transcripts Prompt Speech Synthetic Speech
You took the thing down.
In that case I am indeed unhappy, and greatly to be pitied.