Abstract
The abundance of information regarding the most of domains makes Internet the best resource. Besides its usefulness, it is however difficult to automate the process of information extraction due to lack of structure in online information. The most commonly used information sharing protocol Hyper Text Transfer Protocol (HTTP) makes it possible to embed a lot of noise (like advertisements, images, headers, menus, etc.) in a document containing the useful information. Thus the filtering of noise prior information extraction is necessary. Such filtering of noise has many applications, including cell phone and Personal Digigtal Assistant (PDA) browsing, speech rendering for visually impaired or blind people, open source intelligence and many others. In this paper, we describe a statistical model to filter such noise from a document containing useful information. Our model is based on strategies to analyse the text distribution and link densities in HTML page across all of the nodes of Document Object Model (DOM) tree for detection of useful nodes among them. We describe the validity of model with the help of experiment conducted in implementation of an Early Warning System to facilitate open source intelligence. We also present the general work flow to convert the unstructured online text about terrorists into investigate-able data structure for social network analysis and discuss how our model fits into it.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.