Abstract
Patent text is a mixture of legal terms and domain specific terms. In technical English text, a multi-word unit method is often deployed as a word formation strategy in order to expand the working vocabulary, i.e. introducing a new concept without the invention of an entirely new word. In this paper we explore query generation using natural language processing technologies in order to capture domain specific concepts represented as multi-word units. In this paper we examine a range of query generation methods using both linguistic and statistical information. We also propose a new method to identify domain specific terms from other more general phrases. We apply a machine learning approach using domain knowledge and corpus linguistic information in order to learn domain specific terms in relation to phrases' Termhood values. The experiments are conducted on the English part of the CLEF-IP 2013 test collection. The outcome of the experiments shows that the favoured method in terms of PRES and recall is when a language model is used and search terms are extracted with a part-of-speech tagger and a noun phrase chunker. With our proposed methods we improve each evaluation metric significantly compared to the existing state-of-the-art for the CLEP-IP 2013 test collection: for PRES@100 by 26% (0.544 from 0.433), for recall@100 by 17% (0.631 from 0.540) and on document MAP by 57% (0.300 from 0.191).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.