Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag to better support simultaneous text-to-speech and speech-to-text #1356

Open
okharedia opened this issue Jan 10, 2025 · 1 comment
Open
Labels
question Further information is requested

Comments

@okharedia
Copy link

hi,
I have a usecase for realtime translation, but i noticed that when the agent is speaking, some of the stt transcripts are missing / not added to the chat context so the agent will not consider those for llm and tts. I had a look at the VoicepipelineAgent internals and noticed the below code
(btw I have these config set allow_interruptions=False, preemptive_synthesis=True)

def _validate_reply_if_possible(self) -> None:
        """Check if the new agent speech should be played"""

        if self._playing_speech and not self._playing_speech.interrupted:
            should_ignore_input = False
            if not self._playing_speech.allow_interruptions:
                should_ignore_input = True
                logger.debug(
                    "skipping validation, agent is speaking and does not allow interruptions",
                    extra={"speech_id": self._playing_speech.id},
                )

and

if should_ignore_input:
                self._transcribed_text = ""
                return

I understood the transcribed text is cleared here to allow more natural flow of conversation, to keep a clean chat history of agent replying to the correct speech input.
However this does not quite fit my use case which does not tolerate missing speech input. Would it be possible to add a flag to turn this off (clearing out the transcript while speaking)

@okharedia okharedia added the question Further information is requested label Jan 10, 2025
@davidzhao
Copy link
Member

user speech is always flowing in.. and STT is always running. since LLM requires the full input to be ready in order to start inference, we would wait until the user has completed their turn before starting inference.

can you describe what transcripts you are missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants