Biometric authentication aims to verify whether an entity matches the claimed identity based on biometric data. Despite its advantages, vulnerabilities, particularly those related to spoofing, still exist. Efforts to mitigate these vulnerabilities include multimodal approaches and liveness detection. However, these strategies may potentially increase resource requirements in the authentication process. This paper proposes a multimodal authentication process incorporating voice and facial recognition, with liveness detection applied to voice data using speech recognition. This paper introduces Normalized Longest Word Subsequence (NLWS), a combination of Intersection Over Union (IOU) and the longest common subsequence, to compare the prompted system sentence with the user's spoken sentence at speech recognition. Unlike the Word Error Rate (WER), NLWS has a measurable range between 1 and 0. Furthermore, the paper introduces decision-level fusion in the multimodal approach, employing two threshold levels in voice authentication. This approach aims to reduce resource requirements while enhancing the overall security of the authentication process. This paper uses cosine similarity, Euclidean distance, random forest, and extreme gradient boosting (XGBoost) to measure distance or similarity. The results show that the proposed method has better accuracy compared to unimodal approaches, achieving accuracies of 98.44%, 98.83%, 97.46%, and 99.22% using cosine similarity, Euclidean distance, random forest, and XGBoost calculations. The proposed method also demonstrates resource savings, reducing from 5.19 MB to 0.792 MB, from 7.3294 MB to 1.9437 MB, from 6.6512 MB to 1.3284 MB, and from 7.8632 MB to 2.1517 MB in different distance or similarity measurements