In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.
Read full abstract