Work in progress. This page is for research demonstration purposes only.
1. Abstract
We introduce KALL-E, a novel autoregressive (AR) language
modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods,
KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE-
or diffusion-based components.
Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete
speech tokens. A single AR language model predicts these continuous speech distributions from text, with a
Kullback-Leibler divergence loss as the constraint.
Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2,
and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E
demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E
presents a more straightforward and effective paradigm for using continuous speech representations in TTS.
Figure 1: The overview of KALL-E. Unlike discrete tokens-based language modeling approaches, KALL-E generates
continuous speech distributions conditioned on input texts and acoustic prompts, using a single-stage
decoder-only model as its foundational structure.
2. Zero-shot TTS on the libriTTS teset-clean set
We conduct text-to-speech synthesis on the libriTTS test-clean set to show the capability of zero-shot voice
cloning.
Prompt Speech
YourTTS
VALL-E
NaturalSPeech 2
CosyVoice
KALL-E
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: Again he searched his own thoughts; nor ineffectually as before.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
Text: We recognize our friend Jones, we know cats and dogs when we see them, and so on.
3. Zero-shot emotion cloning TTS on the ESD corpus
We conduct text-to-speech synthesis on the ESD corpus to show the capability of zero-shot emotion
cloning.
Prompt Speech
YourTTS
VALL-E
NaturalSPeech 2
CosyVoice
KALL-E
Text: Old women always like rascals.
Text: The applause continued so long that the comte had ample leisure to join the king.
Text: He worked me very hard; he wanted to be beating me all the time.
Text: Yes; but perhaps I frightened her.
Text: Old women always like rascals.
Text: The applause continued so long that the comte had ample leisure to join the king.
Text: He worked me very hard; he wanted to be beating me all the time.
Text: Yes; but perhaps I frightened her.
4. Zero-shot accent cloning TTS on the VCTK corpus
We conduct text-to-speech synthesis on the VCTK corpus to show the capability of zero-shot accent cloning.
Prompt Speech
YourTTS
VALL-E
NaturalSPeech 2
CosyVoice
KALL-E
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: They have all grown very smart.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.
Text: One could easily fix his type; it never, happily, dies out.
5. Unconditional TTS
We conduct unconditional text-to-speech synthesis. Due to sampling from latent speech distributions, KALL-E can
generate diverse speech with various speaker timbres and speaking styles.
Text transcripts
Synthetic Speech
I didn't know the way to come.
Almost instantly he was forced to the top.
6. Celebrity imitation
KALL-E can imitate the voices of celebrities. We present these examples for purely research purposes.
Text transcripts
Prompt Speech
Synthetic Speech
You took the thing down.
In that case I am indeed unhappy, and greatly to be pitied.