Abstract
BackgroundPhenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text.MethodsTo stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013.ResultsEvaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set.ConclusionsPhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus.
Highlights
Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes
PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information
In this article, we have described our work towards facilitating the development of robust Text mining (TM) systems that can extract information relating to phenotypes from a range of different text types
Summary
Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. The study of disease-phenotype relationships has been hampered by the scarcity of suitable large-scale, machine-readable knowledge bases Existing resources, such as the Online Mendelian Inheritance in Man (OMIM) [3] and the Human Phenotype Ontology (HPO) [4] are manually constructed, making them difficult to update and maintain. They could, be enriched by exploiting the vast amounts of phenotypic information available in various textual sources, including the ever-growing volumes of published biomedical literature, and patient EHRs, which have proliferated with the consolidation of modern healthcare systems [5]. EHRs contain both structured/coded and unstructured information pertaining to patient morbidity, treatment and care over time [6], which can enhance understanding of disease etiology and phenotype-genotype associations [7]. There is an urgent need to develop TM methods that can automate the extraction and integration of vital phenotypic information hidden in narrative text, to help to derive information about disease correlations and support clinical decisions [10]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.