Combining statistical, structural, and linguistic features for keyword extraction from web pages

Himat Shah,Pasi Fränti

doi:10.3934/aci.2022007

Himat Shah, Pasi Fränti

Open Access

https://doi.org/10.3934/aci.2022007

Copy DOI

Abstract

<abstract> <p>Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in <italic>WordNet</italic>. A new method called <italic>ACI‑rank</italic> is also compiled from the best working combination.</p> </abstract>

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Computing and Intelligence	Publication Date: Jan 1, 2022
Citations: 2	License type: cc-by-nc-sa

R Discovery Prime

R Discovery Prime

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Abstract

Talk to us

Similar Papers

More From: Applied Computing and Intelligence

Lead the way for us

Similar Papers

Demersal fish assemblages on seamounts and other rugged features in the northeastern Caribbean
Andrea M Quattrini ... Jason D Chaytor
Deep Sea Research Part I: Oceanographic Research Papers | VOL. 123
Andrea M Quattrini, et. al.Andrea M Quattrini ... Jason D Chaytor
18 Mar 2017
Deep Sea Research Part I: Oceanographic Research Papers | VOL. 123

From Signal to Image Then to Feature: Decoding Pigeon Behavior Outcomes During Goal-Directed Decision-Making Task Using Time-Frequency Textural Features
Mengmeng Li ... Zhigang Shang
-
Mengmeng Li, et. al.Mengmeng Li ... Zhigang Shang
01 Jan 2019
01 Jan 2019

A unified non-rigid feature registration method for brain mapping
Haili Chui ... Anand Rangarajan
Medical Image Analysis | VOL. 7
Haili Chui, et. al.Haili Chui ... Anand Rangarajan
11 Apr 2003
Medical Image Analysis | VOL. 7

Persian Web Pages Clustering Improvement: Customizing the STC Algorithm
Mohammad Azadnia ... Alireza Yari
-
Mohammad Azadnia, et. al.Mohammad Azadnia ... Alireza Yari
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Abstract

Talk to us

Similar Papers

More From: Applied Computing and Intelligence