Fecha de publicación:
12/01/2023
Fuente: WIPO (eseential oils OR extracts)
A speaker feature extraction unit (15c) extracts, using a sequence of acoustic features of frames of an acoustic signal, a speaker embedding vector indicating a speaker feature of each of the frames. A speaker utterance label inference unit (15d) infers, using the extracted speaker embedding vector, a speaker utterance label indicating a speaker of the speaker embedding vector. By using loss functions that are calculated by using the extracted speaker embedding vector, the inferred speaker utterance label indicating the speaker of the speaker embedding vector, and a correct answer label of a speaker utterance label of each of the frames and that include a loss function indicating identity of the speaker of the frames, a training unit (15f) trains and generates a speaker diarization model (14a) for inferring a speaker utterance label of the speaker feature vector of each of the frames.