In this paper, we focus on indexing mechanisms for unstructured clinical big integrated data repository systems. Clinical data is unstructured and heterogeneous, which comes in different files and formats. Accessing data efficiently and effectively are critical challenges. Traditional indexing mechanisms are difficult to apply on unstructured data, especially by identifying correlation information between clinical data elements. In this research work, we developed a correlation-aware relevance-based index that retrieves clinical data by fetching most relevant cases efficiently. In our previous work, we designed a methodology that categorizes medical data based on the semantics of data elements and merges them into an integrated repository. We developed a data integration system for medical data sources that combines heterogeneous medical data and provides access to knowledge-based database repositories to different users. In this research work, we designed an indexing system using semantic tags extracted from clinical data sources and medical ontologies that retrieves relevant data from database repositories and speeds up the process of data retrieval. Our objective is to provide an integrated biomedical database repository that can be used by radiologists as a reference, or for patient care, or by researchers. In this paper, we focus on designing a technique that performs data processing for data integration, learn the semantic properties of data elements, and develop a correlation-aware topic index that facilitates efficient data retrieval. We generated semantic tags by identifying key elements from integrated clinical cases using topic modeling techniques. We investigated a technique that identifies tags for merged categories and provides an index to fetch data from an integrated database repository. We developed a topic coherence matrix that shows how well a topic is supported by a corpus from clinical cases and medical ontologies. We were able to find more relevant results using an annotation index from an integrated database repository, and there was a 61% increase in a recall. We evaluated results with the help of experts and compared them with naive index (index with all terms from the corpus). Our approach improved data retrieval quality by providing most relevant results and reduced data retrieval time as we applied correlation-aware index on an integrated data repository. Topic indexing approach proposed in this research work identifies tags based on a correlation between different data elements, improves data retrieval time, and provides most relevant cases as an outcome of this system.
Read full abstract