Abstract

In text mining of documents of a specific area, especially for generating a map of concepts or terms and a summary of concepts or terms, the quality of keywords strongly affects the results of analysis. A list of technical terms is available as keyword candidates. We can recognize terms in the corpus automatically using a scoring method based on statistics of compound nouns. However, because fractions of words or meaningless strings are also included in those term candidates, further selections are necessary. For such further selection, we consider a method to obtain overlapping terms between the two groups of terms that are extracted from two independent corpora of the same area. For the experimental selection of terms, three target areas are specified: livestock raising, fruit farming, and vegetable gardening. For each area, two groups of documents are collected. The term candidates are extracted from these corpora using a scoring method based on statistics of compound nouns. The terms overlapping the two groups are extracted. After this selection procedure, the proportion of unsuitable terms is lower. From an efficiency viewpoint, the selection procedure improves selection. In addition, the procedure provides the advantage that it is independent from subjective decisions related to manual selection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.