When retrieving scientific documents with mathematical expressions as the main content, both mathematical expressions and their contextual text features require consideration. However, mathematical expressions are different from texts in terms of grammar and semantics. Thus, integrating the above features and realizing scientific document retrieval is difficult. In this study, a retrieval method of scientific documents based on HFS (Hesitation Fuzzy Sets) and BERT (Bidirectional Encoder Representations from Transformer) is proposed. This method is realized through utilizing the advantages of HFS in multi-attribute decision making and BERT in context-dependent similarity calculation. By analyzing mathematical expressions and calculating the membership degree of symbolic multi-attributes, the similarity of mathematical expressions can be obtained, which can improve the accuracy of mathematical expression recall. With the extraction of the text of the expression context, BERT is used to calculate the context similarity. Then, the recalled technical documents are sorted according to the similarity of context, and the final retrieval result can be obtained. Experiments were carried out on 10,372 Chinese and 11,770 English scientific documents in the NTCIR extended data set. The average value of MAP_ <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$k (k=10)$ </tex-math></inline-formula> for the recall results of scientific documents was 74.13%. The average <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> DCG ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n=10$ </tex-math></inline-formula> ) for the ranking of scientific documents was 86.04%.
Read full abstract