Abstract

In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of “use all available data” is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.