详细内容或原文请订阅后点击阅览
speakstream:带有交织数据的流式传输到语音
With the increasing integration of speech front-ends and large language models (LLM),there is a need to explore architectures that integrate these modalities.While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler.Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses技术问题是因为他们需要整个话语来生成杂文音频。在本文中,我们提出了一个可以从…
来源:Apple机器学习研究With the increasing integration of speech front-ends and large language models (LLM),there is a need to explore architectures that integrate these modalities.While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler.Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio.In this paper we present a 'streaming' TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech.The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech.Duing inference our system processes text incrementally while generating consistent speech output, making它适用于LLM可以将文本流式传输到TTS系统的实时应用程序等实时应用程序。回归表明,我们的方法在启用流媒体功能的同时匹配了批处理TTS系统的质量。