Abstract Biomedical information is available to research and development scientists as unstructured text in the form of scientific manuscripts and reports published in the literature and elsewhere. Scientists focused on specific research programs are burdened with surveying vast numbers of publications and reports to acquire information relevant to their efforts. Employing technology as a research aid provides a mechanism to cope with information overload that characterizes the R&D environment. Text mining can extract knowledge from large corpora of biomedical text and make it available to support scientific research and knowledge collections [1, 2] and intelligent PDF reader tools able to search content and find related articles [3] are available; however, such reader tools are typically desktop applications limited to specific platforms and data sources so they cannot easily support broad based integrated scientific search needs for a dispersed R&D organization with a wide variety of content needs. Our team has developed a web-browser based document reader with a built-in exploration tool and automatic concept extraction from biomedical text content. This provides R&D scientists with a simple tool to aid finding, reading, and exploring documents relevant to focused research objectives. The tool, Shangri-Docs, combines a document reader with automatic concept extraction and highlighting of relevant terms based on carefully selected ontologies combined with our custom corporate enterprise taxonomy. Shangri-Docs provides the ability to evaluate a wide variety of document formats (e.g. PDF, Word, PPT, text, etc.) and exploits the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and privately cataloged databases simultaneously. Shangri-Docs incorporates Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific pathology, disease, drug, and biological terms mentioned in the text. cTAKES was originally designed specifically to extract information from clinical medical records. We have extended cTAKES automatic knowledge extraction process to include the R&D biomedical research domain by improving the ontology guided information extraction process. Shangri-Docs could be adapted to other science fields and further customized across our R&D scientific community via our open source, cloud-based, data management system. [1] Funk, et.al., BMC Bioinformatics 2014, 15:59 doi:10.1186/1471-2105-15-59 [2] Kang et al., BMC Bioinformatics 2014, 15:64 doi:10.1186/1471-2105-15-64 [3] Utopia Documents, http://utopiadocs.com [4] Apache cTAKES, http://ctakes.apache.org Citation Format: Chris Mattmann, Lauren Intagliata, Selina Chu, Garth McGrath, Giuseppe Totaro, Daniel Civello, David Ballard, Jeffrey Long, Nipurn Doshi, Shivika Thapar, Michael Livstone, Paul Ramirez, Maureen Cronin. Shangri-Docs: a browser based tool for document exploration and automatic knowledge extraction from unstructured biomedical text. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 5283.
Read full abstract