Abstract

Classical BM25 scoring is designed for unstructured documents. In the past years, people try to adapt the BM25 ranking formula to deal with structured documents. Most works on structured document retrieval treat the combination of field scores, but it is hard to determine the field weights before the formation of document score. We aim to establish a new method to sort the field weights. The motivation comes from two aspects. On the one hand, the construction of interval tree reflects retrieval results with higher-order proximity for a text field. According to writing style, the important sentence or phrase for representing main idea frequently appear in the front or the rear part of a text-field. Therefore, the proximity scoring for different part in a text-field should be different. We thus take higher factor for calculating proximity scoring in the front and the rear parts than in the middle part. On the other hand, the more the interval length includes inquiring terms, the less the proximity scoring is, thereby the higher tf value for term appearing in an interval should affect the computation of proximity scoring. Therefore, we develop a new method for calculating the field weights based on the ranking score. The ranking score for each field can be calculated by interval tree based on terms relevance. Interval tree can be viewed as a tool of higher terms proximity in text visualization. This new field weights reflect the terms proximity and can be used to calculate document scoring for terms retrieval. Experimental results show that the new document scoring model well reflects the terms proximity, and the new document scoring scheme ScoreComp, combined with interval scoring, is more sensitive than scheme FreqComp combined with interval scoring.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.