Structuring of Unstructured Data from Heterogeneous Sources

B L Shilpa,B R Shambhavi

doi:10.17485/ijst/v15i41.1566

Abstract

Objectives: To develop a new data gathering processing under Big Data Perspectives. To convert unstructured text data into structured format by not missing out any text data available. Methods: The unstructured data is preprocessed using modified stemming and tokenization. From the stemming output, the proposed Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram features are derived. Unstructured data is considered from multiple sources like twitter, consumer complaints and news blog. Findings: The proposed model with extant TF-IDF features has exposed relatively high Mean Average Error (MAE) value which is 1.4325 when compared to the proposed model without optimization to be 0.5197. Novelty: The novelty of the research work is of the stemming process where dictionary checking process is added and the improved feature extraction, interclass dispersion coefficient is computed in TF-IDF features. Keywords: Natural language processing; Structured data; Unstructured data; Big data; Feature extraction

Full Text