Abstract

\beginabstract Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. % Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. % Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. % The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information. Method: % We have developed a novel inverse-regression (IR) based information extraction pipeline. % The inverse-regression based supervised filter has been successfully applied to many application domains. % However, its application to the information extraction from unstructured text is hindered primarily by the extreme high-dimensionality of n-gram representations of text. % In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. % First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. % This step reduces the size and improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. % Then we use localized sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. % In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. % % Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. % Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially. Conclusion: Our results demonstrate the potential of \emergencystretch 3em inverse-regression based information extraction pipeline for reliable and efficient information extraction from unstructured text. % The information extracted from the pathology reports can be used along with clinical information, medical imaging, and genomic information to instigate discoveries in cancer research. % \endabstract

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call