speakstream：带有交织数据的流式传输到语音 XiaoMi-AI 科研信息收集

详细内容或原文请订阅后点击阅览

speakstream：带有交织数据的流式传输到语音

2025年5月30日 00:00 33 Comments

来源:Apple机器学习研究

With the increasing integration of speech front-ends and large language models (LLM),there is a need to explore architectures that integrate these modalities.While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler.Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio.In this paper we present a 'streaming' TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech.The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech.Duing inference our system processes text incrementally while generating consistent speech output, making它适用于LLM可以将文本流式传输到TTS系统的实时应用程序等实时应用程序。回归表明，我们的方法在启用流媒体功能的同时匹配了批处理TTS系统的质量。

启用应用程序 models 传输 LLM 匹配 explored 系统的流媒体 audio 流式 speech outputs 功能的 TTS 批处理 need end 表明方法 text 文本质量

speakstream：带有交织数据的流式传输到语音

其他外部链接

Tags

XiaoMi-AI