Voice dictation with audio large language model

Fuente: WIPO "tomato"
A method comprising receiving audio data 102, generating a transcription 151 comprising a sequence of terms 152, such as “Buy some tomatoes and bananas. Change tomatoes to potatoes”, parallel processing the audio data and the transcription using a multimodal large language model (LLM) 150 to identify one or more revision terms 152R, for example “Change”, specifying a revision action to perform on at least one other term in the sequence, in this instance “tomatoes”, and modifying the transcription 151M accordingly – “Buy some potatoes and bananas”. Identifying the revision term(s) may be based on a corresponding user intent 154 determined for each respective term in the sequence, for example, the user 10 does not intend the final transcription to include “tomatoes”. For each term in the sequence, parallel processing may comprise correlating its speech characteristics 156 such as pitch, tone or prosody information determined from the audio data with its corresponding linguistic context 158 determined from the transcription. Transcription correction may be based on a revision token inserted into the sequence, the token indicating an N number of terms for replacement and their corresponding replacement terms. User context data 104 may be obtained to tailor the LLM to a particular user. [Figure 1A]