Concept-based Representation Research Articles

In this thesis, we present models for semantic search: Information Retrieval (IR) models that elicit the meaning behind the words found in documents and queries rather than simply matching keywords. This is achieved by the integration of structured domain knowledge and data-driven information retrieval methods. The research is set within health informatics to tackle the unique challenges within this domain; specifically, how to bridge the 'semantic gap'; that is, how to overcome the mismatch between raw medical data and the way human beings interpret it. Bridging the semantic gap involves addressing two issues: semantics; that is, aligning the meaning or concepts behind words found in documents and queries; and leveraging inference, which utilises semantics to infer relevant information. Three semantic search models -- all utilising concept-based rather than term-based representations---are developed; these include: the Bag-of-concepts model, which utilises concepts from the SNOMED CT medical ontology as its underlying representation; the Graph-based Concept Weighting model, which captures concept dependence and importance in a novel weighting function; and the core contribution of the thesis, the Graph INference model (GIN): a unified theoretical model of semantic search as inference, achieved by the integration of structured domain knowledge (ontologies) and statistical, information retrieval methods. It is the GIN that provides the necessary mechanism for inference to bridge the semantic gap. All three models are empirically evaluated using clinical queries and a real-world collection of clinical records taken from the TREC Medical Records Track (MedTrack). Our evaluation shows that the use of concept-based representations in the Bag-of-concepts model leads to improved retrieval effectiveness. When concepts are combined within the Graph-based ConceptWeighting model, further improvements are possible. The evaluation of GIN highlighted that its inference mechanism is suited to hard queries -- those that perform poorly on a term-based system. In-depth analysis also revealed that the GIN returned many new documents not retrieved by term-based systems and therefore never evaluated for relevance as part of the TREC MedTrack. This highlights that using current IR test collections, where semantic search systems did not contribute to the pool, may underestimate the effectiveness of semantic search systems. This work represents a significant step forward in the integration of structured domain knowledge and data-driven information retrieval methods. Furthermore, the thesis provides an understanding of inference -- when and how it should be applied for effective semantic search. It shows that queries with certain characteristics benefit from inference, while others do not. The detailed investigation into the evaluation of semantic search systems shows how current IR test collections may underestimate effectiveness of such systems and new techniques for evaluation are suggested. The Graph Inference model, although developed within the medical domain, is generally defined and has implications in other areas, including web search, where an emerging research trend is to utilise structured knowledge resources for more effective semantic search.

Read full abstract

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge is, however, far from trivial. The first research theme investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. Document preprocessing heuristics such as stop word removal, stemming, and breakpoint identification and normalization were shown to strongly affect retrieval performance. An effective combination of heuristics was identified to obtain a word-based representation from text for the remainder of this thesis. The second research theme deals with concept-based retrieval. We compared a word-based to a concept-based representation and determined to what extent a manual concept-based representation can be automatically obtained from text. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In the third and last research theme we propose a cross-lingual framework for monolingual biomedical IR. In this framework, the integration of a concept-based representation is viewed as a cross-lingual matching problem involving a word-based and concept-based representation language. This framework gives us the opportunity to adopt a large set of established crosslingual information retrieval methods and techniques for this domain. Experiments with basic term-to-term translation models demonstrate that this approach can significantly improve word-based retrieval. Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain. Available online from http://purl.utwente.nl/publications/72481.

Read full abstract

Concept-based Representation Research Articles

Related Topics

Articles published on Concept-based Representation

Mapping the Landscape of Impostor Phenomenon Research in Organizational Behavior: A Bibliometric Study between 2003 and 2022

Semantic Recovery of Traceability Links between System Artifacts

Predicting software defect type using concept-based classification

Unreported links between trial registrations and published articles were identified using document similarity measures in a cross-sectional analysis of ClinicalTrials.gov

Counting trees in Random Forests: Predicting symptom severity in psychiatric intake reports

Wikipedia-based cross-language text classification

Concept-based item representations for a cross-lingual content-based recommendation process

Graph-based Methods for Significant Concept Selection

Semantic Search as Inference

Semantic grounding of social annotations for enhancing resource classification in folksonomies

Mining Conceptual Relations from Textual Web Content Using Leximancer

The uncertain representation ranking framework for concept-based video retrieval

Proof of concept

Opinion comparison between internet forums and customer reviews

A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports

Consistency-driven knowledge elicitation: using a learning-oriented knowledge representation that supports knowledge elicitation in NeoDISCIPLE

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Concept-based Representation Research Articles

Related Topics

Articles published on Concept-based Representation

Mapping the Landscape of Impostor Phenomenon Research in Organizational Behavior: A Bibliometric Study between 2003 and 2022

Semantic Recovery of Traceability Links between System Artifacts

Predicting software defect type using concept-based classification

Unreported links between trial registrations and published articles were identified using document similarity measures in a cross-sectional analysis of ClinicalTrials.gov

Counting trees in Random Forests: Predicting symptom severity in psychiatric intake reports

Wikipedia-based cross-language text classification

Concept-based item representations for a cross-lingual content-based recommendation process

Graph-based Methods for Significant Concept Selection

Semantic Search as Inference

Semantic grounding of social annotations for enhancing resource classification in folksonomies

Mining Conceptual Relations from Textual Web Content Using Leximancer

The uncertain representation ranking framework for concept-based video retrieval

Proof of concept

Opinion comparison between internet forums and customer reviews

A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports

Consistency-driven knowledge elicitation: using a learning-oriented knowledge representation that supports knowledge elicitation in NeoDISCIPLE