Voice Dictation With Audio Large Language Model

Fuente: WIPO "tomato"
A method includes receiving audio data characterizing an utterance spoken by a user. The method also includes processing the audio data to generate a transcription of the utterance using a multimodal large language model (LLM). The transcription includes a sequence of terms. The method also includes processing, using the multimodal LLM, the audio data and the transcription in parallel to identify one or more revision terms in the sequence of terms. The one or more revision terms specify a revision action to perform on at least on other term in the sequence of terms. The method also includes modifying the transcription based on the one or more revision terms.