Abstract
Scientific documents and magazines involve large number of mathematical expressions and formulas along with text. The continuous growth of such documents necessitates the requirement of developing specialized tools and techniques, which could handle and analyse mathematical expressions and formulas. Mathematical expressions and formulae are highly structured and quite different from traditional text. Due to which conventional text retrieval system performs poorly in retrieving scientific documents based on mathematical expression formulated as a query. Mathematical information retrieval is concerned with finding information in documents that include mathematics. To address the challenges posed by mathematical formulae as compared to text, this paper aims to construct a math aware search engine, which can retrieve relevant scientific documents based on a mathematical query. A novel signature based hashing scheme to index raw mathematical web documents is proposed in this paper, which can also take mathematical notational equivalences into account. The proposed system demonstrates better precision and stability of the ranked results when compared with other related state-of-the-art math aware search engines.
Highlights
Mathematics is a very important constituent in the domain of Science, Technology, Engineering and Mathematics (STEM)
The field of information retrieval (IR) has been exhaustively explored for many decades but a distinct focus is required for Mathematical Information Retrieval (MIR) because conventional text retrieval systems are not suitable for retrieving mathematical expressions [3,4]
In attempt of crafting a better retrieval model in the domain of MIR systems, we theorized that a signature based hashed indexing scheme would be better alternative instead of tree based or text based model
Summary
Mathematics is a very important constituent in the domain of Science, Technology, Engineering and Mathematics (STEM). There would be a seldom scientific document without a single mathematical expression (ME)/symbol. In this digital era, with more and more scientific documents being generated, information explosion was inevitable. Being a very simple and efficient model to implement, it has some limitations It fails to retrieve results with partial match and secondly general users find it very difficult to form complex queries. Due to these reasons, its performance results in either high precision and low recall or low precision and high recall. The strict Boolean and fuzzy-set models are preferable to other models in terms of computational requirements [8]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: APTIKOM Journal on Computer Science and Information Technologies
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.