Abstract

This paper discuss topic distillation, an information retrieval problem that is emerging as a critical task for the www. Algorithms for this problem must distill a small number of high-quality documents addressing a broad topic from a large set of candidates. We give a review of the literature, and compare the problem with related tasks such as classification, clustering, and indexing. We then describe a general approach to topic distillation with applications to searching and partitioning, based on the algebraic properties of matrices derived from particular documents within the corpus. Our method – which we call special filtering – combines the use of terms, hyperlinks and anchor-text to improve retrieval performance. We give results for broad-topic queries on the www, and also give some anecdotal results applying the same techniques to US Supreme Court law cases, US patents, and a set of Wall Street Journal newspaper articles.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.