Using entity identification and classification for automated integration of spatial-temporal data

R Ahsan,R Neamtu,E Rundensteiner

doi:10.2495/dne-v11-n3-186-197

Abstract

Big data, crucial to answering economic, social, and political questions facing our society, tend to be diverse and distrib- uted through various sites across the Internet. The creation of tools to integrate and analyze such data is of paramount interest. Yet the automation of these processes continues to be a great challenge. Our work rests on the observation that a great number of public data sources in domains ranging from economic to demographic, although of complex structure, often share key similarities, namely the presence of the Time and Location. Our proposed Data Integration through Object Modeling framework or DIOM tackles the critical problem of automating data integration from a variety of public websites by abstracting key features of multi-dimensional tables and interpreting them in the context of knowledge- centered Unified Spatial Temporal Model. Our classification-driven extractors are trained to identify and classify entities from both structured and unstructured parts of spreadsheets. The unstructured part contained in titles, headers, and footers reveals critical information, so-called Implicit Knowledge, crucial to the correct interpretation of data. Our experimental results on real world datasets from heterogeneous public data sources show increased accuracy by 25% compared to state-of-the-art approaches. For example, a lengthy process of collecting and analyzing historical data from different states led to success in repealing the Sales and Use Tax on computer and software services, introduced in Massachusetts in 2013. In the quest to fight this action perceived as detrimental to the business growth and economic health of the state, many organizations worked together to create an integrative data source for high-fidelity and talent com- petitive metrics that can be used to measure the economic competitiveness and influence policy making. Large-scale data integration is crucial for the success of such endeavors. Data from a wide spectrum of diverse websites from the Tax Policy Center, the Census Bureau, to websites like the National Science Foun- dation and the Bureau of Economic Analysis had to be extracted, integrated, and warehoused. These web data sources represent valuable public knowledge ready to be leveraged for policy decision making and economic forecasting. The extraction and integration of data proved challenging and time consuming. Yet, the appetite for leveraging new data sources appears endless, so automation becomes critical to the success of building and growing rich economic indexes. The Spreadsheet Integration Problem One obstacle in capitalizing on this wealth of knowledge is the lack of generalized automated tools for data integration. Unfortunately, while progress has been made on integration (1-3), it remains challenging and labor-intensive to integrate data of the rich variety required to answer com- plex societal questions. A large amount of information collected from these web sites is retrieved in the form of spreadsheets. We demonstrate that actual spreadsheets from domains like tax and economics

Full Text