The Metaverse’s emergence is redefining digital interaction, enabling seamless engagement in immersive virtual realms. This trend’s integration with AI and virtual reality (VR) is gaining momentum, albeit with challenges in acquiring extensive human action datasets. Real-world activities involve complex intricate behaviors, making accurate capture and annotation difficult. VR compounds this difficulty by requiring meticulous simulation of natural movements and interactions. As the Metaverse bridges the physical and digital realms, the demand for diverse human action data escalates, requiring innovative solutions to enrich AI and VR capabilities. This need is underscored by state-of-the-art models that excel but are hampered by limited real-world data. The overshadowing of synthetic data benefits further complicates the issue. This paper systematically examines both real-world and synthetic datasets for activity detection and recognition in computer vision. Introducing Metaverse-enabled advancements, we unveil SynDa’s novel streamlined pipeline using photorealistic rendering and AI pose estimation. By fusing real-life video datasets, large-scale synthetic datasets are generated to augment training and mitigate real data scarcity and costs. Our preliminary experiments reveal promising results in terms of mean average precision (mAP), where combining real data and synthetic video data generated using this pipeline to train models presents an improvement in mAP (32.35%), compared to the mAP of the same model when trained on real data (29.95%). This demonstrates the transformative synergy between Metaverse and AI-driven synthetic data augmentation.