Abstract
Text mining is generally used to provide structure to texts by extracting words and provide summary information related to it. This can be improved by extracting commonly occurring phrases instead, which provide more specific information. For example, knowing that the word ‘valuable’ occurs 5 times in a text gives some information, but knowing that it is actually the phrase ‘not valuable’ that appears 5 times is much more informative. Extracting frequent phrases is not commonly done due to inherent complications, the most significant being double counting. This occurs when words or phrases are counted when they appear inside longer phrases that themselves are also counted, resulting in a large selection of mostly meaningless phrases that are frequent only because they occur inside frequent super phrases. Several publications describe solutions to this issue; however, they either require a list of so-called quality phrases to be available to the extracting process, or they require human interaction to identify quality phrases during the process. In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks, does not start with a stop word, with the exception of the stop words ‘not’ and ‘no’, does not end with a stop word, is frequent within those texts without being double counted, and is meaningful to the user. We present here a phrase mining method that eliminates double counting via a unique rectification process that does not require lists of quality phrases, is simple and, like text mining of words, is based entirely on frequency. It requires neither human effort to label phrases nor a general knowledge base, it is efficient, and extracts principal phrases only. Phrases of any range of number of words can be extracted, including those consisting of just one word. This is an automated method that does not require human input, thus allowing the extraction of phrases in any language, or from texts with unfamiliar or unusually constructed phrases. The recall for our method will be 100%; as long as a phrase is frequent, it will be mined. Precision, i.e. the proportion of phrases mined that are high-quality phrases, will not be 100% since some frequent sequences of words may not constitute high quality phrases. Whether a phrase is high quality is subjective, however, and small selection of phrases deemed low quality can be excluded, thus improving the precision.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have