Abstract

Retrieving information from large document databases is in the focus of scientific research in recent years. In this paper, a parallel algorithm for searching biomedical documents based on the MapReduce technique is presented. The algorithm consists of three phases: preprocessing phase, document representation phase, and searching phase. In the first phase, lemmatization and elimination of stop words are performed. In the second phase, each of the documents is represented as a list of pairs (word, tf-idf index of the word). The third phase represents the main searching procedure. It uses a specially designed ranking criterion, which is based on a combination of the term frequency - inverse document frequency (tf-idf) index and the indicator function for each query word. Four different versions of ranking criteria are proposed and analyzed. The algorithm performances are tested on different subsets of the large and well-known PubMed biomedical document database. The results obtained by the experiments indicate that the proposed parallel algorithm succeeds in finding high-quality results in a reasonable time. Comparing to the sequential variant of the algorithm, the experiments show that the parallel algorithm is more efficient since it finds high-quality solutions in significantly less time.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.