Abstract

Classifying web documents is considered as one of the most important tasks to reveal the terrorism-related documents. Internet provides a lot of valuable information to the users and the amount of web contents is progressively increasing. This makes it very difficult to identify potentially dangerous documents. Simply extracting keywords from documents is not enough to classify the contents. To build automated document classification systems, many techniques have been studied so far, but they are mostly statistical and knowledge-based approaches. These methods, however, do not yield satisfactory results because of complexity of natural languages. To overcome this deficiency, we propose a method to use word similarity based on WordNet hierarchy and n-gram data frequency. This method was tested with the sampled New York Times articles by querying four distinct words from four different areas. Experimental results show our proposed method effectively extracts context words from the text and identifies terrorism-related documents.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.