Abstract

The volume of scientific papers published annually in the biomedical domain is continuously increasing. Streamlining the process of identifying the most critical and significant nuggets of information (such as hypotheses, observations, interventions, findings) in a given research publication is a challenging but worthwhile task. This essential information, known as scientific artefacts, underpins the knowledge used by many health professionals in the decision-making process or researchers in creating systematic reviews; however most of today’s search engines are unable to identify these artefacts. Evidence Based Medicine (EBM) represents a framework that encompasses decision-making in the healthcare domain, based on providing medical practitioners with the best available evidence so they can choose the optimum treatment for individual patients. In order to provide patients with the best treatment, health professionals need access to current, timely and reliable evidence retrieved from relevant published medical research or previously synthesised evidence. Hence, devising mechanisms that can automatically identify, retrieve, consolidate and present scientific artefacts, based on a given query, has the potential to greatly facilitate collating related evidence and ultimately streamline medical decision-making. This thesis represents an attempt to define a comprehensive framework for acquiring and managing scientific artefacts in the EBM domain – by transforming unstructured publications into structured, consolidated, pertinent knowledge. There have been previous attempts to model such information (e.g., supporting and contradicting statements), however these approaches have primarily focused on providing users with conceptual high-level frameworks and associated manual annotation services. The approach proposed in this thesis employs novel, sets of low-level features to uniquely identify key scientific information in EBM, and enable knowledge extraction and retrieval. This will also lead to automatic creation of networks of scientific artefacts, and eventually the detection of effects across diverse artefacts (i.e., new potential drug treatments). This goal will be attained by firstly modelling and extracting scientific artefacts from publications (more specifically, abstracts) and then consolidating and linking them using Linked Data approaches. The first step for pinpointing the best evidence in the published research is to formulate clinical queries and their answers. Hence, a comprehensive and fine-grained model is essential to formulate key factors of evidence-based decision making according to various medical cases. The Problem/Population, Intervention, Comparison, and Outcome (PICO) framework is a specialised model to frame and answer a clinical or health care related question. An extension of PICO formalises this fundamental information by classifying it into six classes: Population, Intervention, Background, Outcome, Study Design, and Other (called the PIBOSO model). The PIBOSO model has been used as the underlying model throughout this thesis for defining the scientific artefacts in publications. Once modelled, the challenge then shifts towards automatically recognising such scientific artefacts within a published abstract and detecting similar occurrences across multiple abstracts. Machine Learning techniques have been widely applied in this context, especially since the recognition task can be formulated as a sentence classification task, and therefore can be addressed using classification techniques. This thesis presents a scientific artefact classifier that is trained on a novel set of discriminative features. The results indicate that this approach represents a marked improvement compared to the state of the art. In order to be able to find those related scientific artefacts (or evidence) extracted from a large number of published abstracts and then consolidate those that are conceptually similar, this thesis proposes an improved semantic similarity quantification approach. A unique set of similarity measures, which examines similarity of sentences from different syntactic, structural, and semantic features, is presented and then used to train an ensemble regressor. This ensemble can accurately predict the semantic similarity of both generic and domain-specific English sentences. The quantified similarities of scientific artefacts are then employed to consolidate and link those that are highly similar. The resulting Knowledge Base comprises a network of semantically related scientific artefacts, abstracts and publications. The holistic framework described in this thesis, has the potential to transform large corpuses of unstructured text into an enriched, consolidated and linked network of scientific artefacts. The resulting Knowledge Base (which will evolve, improve and expand over time) enables one to quickly gain a broad and deep understanding of the current state of evidence (PIBOSO scientific artefacts) related to a medical topic.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call