Abstract
Incomplete data in my work is defined as the data with extremely limited samples or data observed, which brings big challenges to data mining and real applications. Such extremely limited sample data would obviously give us terrible bias and inaccurate results. In my work, I studied three real applications in smart city and proposed novel algorithms and methodology to tackle the incomplete data analysis problem. Application one: Given one over ten thousand of the whole set of vehicles in a city, how can we still retrieve the vehicle distribution and detect the hot spots or crowded areas in the city? The traditional density-based clustering methods work not well because of the very limited and errorable vehicle density or location information. Hence we need new algorithms to handle such incomplete data, in terms of accuracy and scalability. On the other hand, the vehicle traces are typical spatio-temporal data, which requires efficient approaches. In this paper, we have an interesting observation that the vehicle speed can indicate the crowdedness of a given area. In other words, if a given area is very crowded, then the vehicles' speed in this area is low; while if this area is not crowded, then the vehicles' speed in this area prefers high. As such the mobility of samples is naturally incorporated and a novel non-density-based clustering method is developed, called mobility-based clustering. Several key factors beyond the vehicle crowdedness have been identified and techniques to compensate these effects are proposed. We evaluate the performance of mobility-based clustering based on real traffic situations. Experimental results show that using 0.3 % of vehicles as the samples, mobility-based clustering can accurately identify hot spots which can hardly be obtained by the latest representative algorithm UMicro. Application two: Given one over one hundred thousand of the whole set of population in a city, how can we still retrieve the population distribution and detect the population bursts in the city? To address the difficulties of lacking real population data, we take the advantage of mobile phone networks, which offer enormous spatiotemporal communication data among persons. More importantly, we find the fact that we can utilize these mobile phone data to infer and approximate the population data. Thus, we can study the population burst detection problem taking advantages of the unique features hidden in the mobile phone data. In this work, we present PBD, a system to conduct Population Burst Detection. First, we propose an effective clustering method, correlation-based clustering, to cluster the incomplete location information from the mobile phone data. Then, we design an adaptive parameter-free detection method, R-scan, to capture the distributed dynamic bursts. Finally, we devise an efficient algorithm, BT-miner, to retrieve burst trajectories. The experimental results on real life mobile phone data confirm the effectiveness and efficiency of the proposed algorithms. Application three: We study a very interesting and practical problem, pattern matching in a distributed mobile environment. Pattern matching is a well-known problem and extensive research has been conducted for performing effective and efficient search. However, previous approaches assume that data are centrally stored, which is not the case in a mobile environment (e.g., mobile phone networks), where one person's pattern could be separately stored in a number of different stations, and such a local pattern is incomplete compared with the global pattern. A simple solution to pattern matching over a mobile environment is to collect all the data distributed in base stations to a data center and conduct pattern matching at the data center afterwards. Clearly, such a simple solution will raise huge amount of communication cost, which may cause the traffic congestion brought by the limited wireless bandwidth to be even worse. Therefore, a communication efficient and search effective solution is necessary. In this work, we propose a novel solution based on a well-designed Weighted Bloom Filter (WBF), called, Distributed Incomplete pattern matching (DI-matching ), to find target patterns over a distributed mobile environment. Specifically, to save communication cost and ensure pattern matching in distributed incomplete patterns, we use WBF to encode a query pattern and disseminate the encoded data to each base station. Each base station conducts a local pattern search according to the received WBF. Only qualified IDs and corresponding weights in each base station are sent to the data center for aggregation and verification. Through non-trivial theoretical analysis and extensive empirical experiments on a real city-scale mobile networks dataset, we demonstrate the effectiveness and efficiency of our proposed solutions.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have