The application of voice recognition systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.