SyncSpeech

SyncSpeech: Low-Latency and Efficient Text-to-Speech based on
Temporal Masked Transformer

[Paper] [Code]

Anonymous authors

Abstract: Current text-to-speech (TTS) models face a persistent limitation: autoregressive (AR) models suffer from low generation efficiency, while modern non-autoregressive (NAR) models experience high latency due to their unordered temporal nature. To bridge this divide, we introduce SyncSpeech, an efficient and low-latency TTS model based on the proposed Temporal Mask Transformer (TMT) paradigm. TMT unifies the temporal ordering of AR generation with the parallel decoding of NAR models within a single paradigm. Specifically, a two-stage training strategy with tailored sequence construction is designed for TMT to handle streaming text input. During inference, SyncSpeech achieves high efficiency by decoding all speech tokens corresponding to each newly arrived text token in a single step, and low latency by beginning to generate speech immediately upon receiving the second text token. Evaluations show that SyncSpeech maintains speech quality comparable to state-of-the-art TTS models, while reducing first-packet latency by a factor of 5.8 and improving the real-time factor by a factor of 8.8.

*Here not only provide samples of generated speech but also elaborate in more detail on the inference process and other related content.

Zero-Shot Text-to-Speech on LibriSpeech

Text	Speaker Prompt	CosyVoice FPL-A:0.22, FPL-L:0.93, RTF:0.45	CosyVoice2 FPL-A:0.22, FPL-L:0.35, RTF:0.45	SyncSpeech FPL-A:0.04, FPL-L:0.09, RTF:0.05
I will show you what a good job I did," and she went to a tall cupboard and threw open the doors.
Costly entertainments, such as the potlatch or the ball, are peculiarly adapted to serve this end.
Whoever, therefore, is ambitious of distinction in this way ought to be prepared for disappointment.
To relieve her from both, he laid his hand with force upon his heart, and said, "Do you believe me?"
Patsy and Beth supported their cousin loyally and assisted in the preparations.
Her manner was neither independent nor assertive, but rather one of well bred composure and calm reliance.
"Yes; the character which your royal highness assumed is in perfect harmony with your own."
The military force, partly rabble, partly organized, had meanwhile moved into the town."

Zero-Shot Text-to-Speech for Mandarin

Speech Prompt	Text	CosyVoice2	SyncSpeech
	在这种时候, 他能睡一天，把苦恼交给了梦。
	我不敢走进去，感觉那里有东西在等着我，怎么办。
	泰勒维尔是一个位于美国伊利诺伊州克里斯蒂安县的城市。
	今天买了个折叠椅，坐在阳台上晒太阳，风吹过来，忽然觉得自己像是被世界温柔拥抱了一下。

Inference Process Details

An overview of the inference steps in the proposed TMT framework when n=1 and q=1. Step (1) refers to predicting the duration l₁ of the first text token y₁ once the second text token y₂ has been received. Next, Step (2) involves the parallel prediction of all speech tokens (from s₁ to s₃) corresponding to the first text token y₁, as well as the duration l₁ of the second text token y₂. Step (3) and all subsequent steps follow the same procedure as Step (2), with the exception of the final step, which does not predict duration.

Additionally, we have implemented custom designs for the KV Cache. By employing distinct positional encodings for the text token sequence and the speech token sequence, we ensure that the attention matrices between past text tokens and past speech tokens do not need to be fully recomputed. Instead, we only calculate the attention matrices for the new text tokens with past speech tokens, new text tokens with new speech tokens, and past text tokens with new speech tokens.

Speech Token Accuracy curves of two-stage training and single-stage training

The accuracy curve of Speech tokens during the training phase, where the green line represents the two-stage training process and the pink line represents the one-stage training process. It is evident that two-stage training achieves significantly higher training efficiency and accuracy compared to one-stage training.

SyncSpeech: Low-Latency and Efficient Text-to-Speech based on Temporal Masked Transformer

Zero-Shot Text-to-Speech on LibriSpeech

Zero-Shot Text-to-Speech for Mandarin

Inference Process Details

Speech Token Accuracy curves of two-stage training and single-stage training

SyncSpeech: Low-Latency and Efficient Text-to-Speech based on
Temporal Masked Transformer