Online social networks, such as Facebook, Twitter, LinkedIn, etc., have grown exponentially in recent times with a large amount of information. These social networks have huge volumes of data especially in structured, textual, and unstructured forms which have often led to cyber-crimes like cyber terrorism, cyber bullying, etc., and extracting information from these data has now become a serious challenge in order to ensure the data safety. In this work, we propose a new, supervised approach for Information Extraction (IE) from Web resources based on remote dynamic editing, called EIDED. Our approach is part of the family of IE approaches based on masks extraction and is articulated around three algorithms: (i) a labeling algorithm, (ii) a learning and inference algorithm, and (iii) an extended edit distance algorithm. Our proposed approach is able to work even in the presence of anomalies in the tuples such as missing attributes, multivalued attributes, permutation of attributes, and in the structure of web pages. The experimental study, which we conducted, on a standard database of web pages, shows the performance of our EIDED approach compared to approaches based on the classic edit distance, and this with respect to the standard metrics recall coefficient, precision, and F1-measure.
Read full abstract