Abstract

In recent years, a growing number of math contents are available on the Web. When conventional search engines deal with mathematical expressions, the two-dimen- sion-al structure of mathematical expressions is lost, which results in a low performance of math retrieval. While the retrieval technology specifically designed for mathematical expressions is not mature currently. Aiming at these problems, an improved mathematical expression indexing and matching method was proposed through employing full text index method to deal with the two-dimensional structure of mathematical expressions. Firstly, through the fully consideration of LaTeX formulae’ characteristics, a feature representation method of mathematical expressions and a clustering method of feature keywords were put forward. Then, an improved inter-relevant successive trees index model was applied to the construction of the mathematical expression index, in which the cluster algorithm of mathematical expression features was employed to solve the problem of the quantity growth of the trees in processing large amount of formulae. Finally, the matching algorithms of mathematical expressions were given which provide four query modes called exact matching, compatible matching, sub-expression matching and fuzzy matching. In browser/server mode, 110027 formulae were used as experimental samples. The index file size was 29.02 Mb. The average time of retrieval was 1.092 seconds. The experimental result shows the effectiveness of the method.

Highlights

  • With the rapid increase of the amount of science and technology documents which contain many mathematical formulae with various expressing formats such as LaTeX and MathML in computers and network, finding and obtaining the required information according to the formulae in these documents becomes an urgent task in the fields of information retrieval and searching engine

  • When conventional search engines deal with mathematical expressions, the two-dimensional structure of mathematical expressions is lost, which results in a low performance of math retrieval

  • An improved inter-relevant successive trees index model was applied to the construction of the mathematical expression index, in which the cluster algorithm of mathematical expression features was employed to solve the problem of the quantity growth of the trees in processing large amount of formulae

Read more

Summary

Introduction

With the rapid increase of the amount of science and technology documents which contain many mathematical formulae with various expressing formats such as LaTeX and MathML in computers and network, finding and obtaining the required information according to the formulae in these documents becomes an urgent task in the fields of information retrieval and searching engine. Based on the normalized presentation tree, terms are extracted using a hierarchical generalization technique; the inverted index is used to store the key information; through calculating the similarity score of the keywords in the query expression and index file and the matching degree of key words in different levels sufficiently, the system shows the ranking result In this system, the mathematical expression of the LaTeX form was transformed by the tree to construct the index, and the similar search of formulae was realized. The method of extending the function of the existing text search engine for math retrieval need to convert formulae into character strings, which cannot provide a complete searching function for formulae Another strategy of realizing math retrieval that designs the special index and corresponding matching algorithm is still not mature. The characteristics of the method is utilizing the extracted formulae features to form the relationships as predecessor and successor, for constructing math index in the mode of inter-relevant successive tree

The Improved Math Index
Feature Extraction of Mathematical Expressions
Feature Clustering of Mathematical Expressions
Expressing Rules of Formulae in ISTR
Index Information in ISTR
Index Construction Algorithm
Case Analysis
The Algorithm of Mathematical Expression Retrieval
Experiments Results
Clustering Result
Space Efficiency
Contrast Experiment with the Original Index Structure
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call