Abstract
Automatic key phrase extraction is the task of automatically selecting a set of phrases that describe the content of a simple sentence. That a key phrase is extracted means that it is present verbatim in the sentence to which it is assigned. Accurate key phrase extraction is fundamental to the success of many recent digital library applications, clustering, and semantic information retrieval techniques. The present research discusses a support vector machines (SVMs) approach for Vietnamese key phrase extraction and presents a number of experiments in which performance is incrementally improved. In general, the Vietnamese key phrase extracting process consists of three steps: word segmentation for identifying lexical units in an input sentence, part-of-speech tagging for words, and key phrase extraction for phrases. The performance of Vietnamese key phras extraction systems is generally measured by the precision rate attained. This depends strongly on the nature and the size of a training set of key phrases. Most results are superior to 70.30% with a training set of 9,000 Vietnamese key phrases with of 2,000 sentences which was selected from the corpus of Vietnamese Lexicography Center (www.vietlex.com.vn). I. INTRODUCTION Key phrases, which can be single keywords or multiword key terms, are linguistic descriptors of documents. They are often sufficiently informative to help human readers get a feel for the essential topics and main contents included in the source documents. Key phrases have also been used as features in many text-related applications such as text clustering, document similarity analysis, and document summarization. Manually extracting key phrases from a number of documents is quite expensive. Automatic key phrase extraction is a maturing technology that can serve as an efficient and practical alternative. Key phrase extraction may be viewed as a classification problem. A document can be seen as a bag of phrases wherein each phrase belongs to one of the two possible classes: either it is a key phrase or it is a non-key phrase. We approach this problem from the perspective of machine learning research and treat it as a problem of supervised learning from examples. We divide our documents into two sets: training documents and testing documents. The training documents are used to tune the key phrase extraction algorithms, in order to attempt to maximize their performance. That is, the training documents are used to teach the supervised learning algorithms how to distinguish key phrases from non-key phrases. The testing documents are used to evaluate the tuned algorithms. The motivation for this work is to establish the range of applications for key phrases. There are at least five general application areas for key phrases: Text summarization, human- readable index, interactive query refinement, machine-readable index, and feature extraction as preprocessing for further machine analysis. SVMs is an extraordeinary phenomenon in machine learning methodology. Research that applies this method has achieved good results, and has proven to be more effective than research that uses other learning methods, especially when applied to problems of natural language processing (3, 6, 7), pattern classification, or pattern recognition (8). In this paper, we present the application of SVMs to build a Vietnamese key phrase extraction system for Vietnamese text. In this section a Support Vector Machines model is introduced for Vietnamese key phrase extraction. The rest of the paper is organized as follows: Section 2 introduces a Support Vector Machines approach; Section 3 proposes a methodology of Vietnamese key phrase extraction model; Section 4 evaluates our approach on many Vietnamese query sentences with different styles of texts; and finally the conclusion is presented in Section 5.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.