Abstract

Keyphrase extraction is a process by which the set of words or phrases that best describe a document is specified. The phrases could be extracted from the document words itself, or they could be external and specified from an ontology for a given domain. Extracting keyphrases from documents is critical for many applications such as information retrieval, document summarization or clustering. Many keyphrase extractors view the problem as a classification problem and therefore they need training documents (i.e. documents which their keyphrases are known in advance). Other systems view keyphrase extraction as a ranking problem. In the latter approach, the words or phrases of a document are ranked based on their importance and phrases with high importance (usually located at the beginning of the list) are recommended as possible keyphrases for a document. This abstract explains Shihab; a system for extracting keyphrases from Arabic documents. Shihab views keyphrase extraction as a ranking problem. The list of keyphrases is generated by clustering the phrases of a document. Phrases are built from words which appear in the document. These phrases consist of 1-, 2- or 3-words. The idea is to group phrases which are similar into one cluster. The similarity between phrases is determined by calculating the Dice value of their corresponding contexts. A phrase context is the sentence in which that phrase appears. Agglomerative hierarchical clustering is used in the clustering phase. Once the clusters are ready, then each cluster will nominate a phrase to the set of candidate keyphrases. This phrase is called cluster representative and is determined according to a set of heuristics. Shihab results were compared with other existing keyphrase extractors such as KP-Miner and Arabic-KEA and the results were encouraging.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.