Abstract
The document-length normalization problem has been widely studied in the field of information retrieval. The cosine normalization (Baeza-Yates and Ribeiro-Neto, 1999), the maximum if normalization (Allan et al., 1997) and the byte length normalization (Robertson et al., 1992) are the most commonly used normalization techniques. In (Singhal et al., 1996), authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of different lengths with the same probability. We first show here that the document and query sizes are indeed very influent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores distribution w. r. t. document and query sizes in order to normalize them. Experimental results appear to indicate that our approach, as well in the field of classical Information Retrieval as when applied to a document clustering process, allows to judge similarities really more fairly.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.