Abstract

The C/NC value method combines linguistic and statistical information for the automatic extraction of multi-word technical terms from machine readable corpora. C-value enhances the commonly used statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, nested terms. Nested terms are those which also exist as substrings of other terms. Consider the following term: mitogen-activated protein kinase. Valid substrings of the longer term are also extracted, e.g. mitogen-activated protein, protein kinase. NC/value incorporates contextual information in the form of statistical (weights) and linguistic information improving the performance of C/value. Deeper forms of contextual information (semantic knowledge) have also been used. The measures have been applied to medical corpora in English and have been also adapted for Japanese. We applied the C/NC value into a collection of 2000 abstracts from MEDLINE. The domain area was that of nuclear receptors. Before applying the first measure (C-value) we tagged the corpus using MXPOST tagger, freely available from the University of Pennsylvania. The tagger was not trained for our corpus. The wordforms were stemmed using Porter’s stemming algorithm. Stemming was used to deliver better results for statistical measures. The same corpus was run on a different tagger based on constraint based grammars, ENGCG. We used a domain specific stop list which was produced after the first results of C/NC value, containing single words (e.g. dramatic, data) and multiword units (e.g. brief review). The following linguistic filters were applied, based on commonly used term formation patterns:

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call