The continuous global urbanization with rapid and dynamic transitioning in traffic situations among highly populated cities results in difficulty for data collection and communication. Data collection for millions of vehicles hinders by various problems, i.e., higher cost of energy, time, space, and storage resources. Moreover, higher data traffic results in higher delays, larger throughput, excessive bottlenecks, and frequent repetition of data. To better facilitate the aforementioned challenges and to provide a solution, we have proposed a lightweight Machine Learning based data collection protocol named ML-TDG to effectively deal with higher data volumes in a real-time traffic environment capable of bringing the least burden on the network while utilizing less space, time, and energy. ML-TDG is functional based on Apache Spark, an effective data processing engine that indexes the data based on two logs, i.e., old commuters or frequent/daily commuters and second new/occasional commuters. The proposed protocol’s main idea is to utilize real-time traffic, distinguish the indexes in parallel based on two assigned logs criteria to train the network, and collect data with the least sources. For energy and time optimization, dynamic segmentation switching is introduced which is an intelligent road segments division and switching for reducing bottlenecks and replication. ML-TDG is tested and verified on Dublin, Ireland’s busiest motorway M50. ML-TDG performs the data collection, data sorting, and network training to decide the next execution altogether for better optimization every time. The experimental results verify that our proposed protocol is attaining higher performance with lower resource requirements along with rich and time-efficient sustainable data collection clusters in comparison with baseline protocols.