Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • Open Access Icon
  • Front Matter
  • 10.1007/s10791-022-09409-8
Guest editorial: special issue on ECIR 2021
  • Jan 1, 2022
  • Information Retrieval
  • Djoerd Hiemstra + 1 more

Contains fulltext : 250434.pdf (Publisher’s version ) (Open Access)

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10791-018-9332-3
(CF)2 architecture: contextual collaborative filtering
  • Jan 1, 2018
  • Information Retrieval
  • Dennis Bachmann + 6 more

Recommender systems have dramatically changed the way we consume content. Internet applications rely on these systems to help users navigate among the ever-increasing number of choices available. However, most current systems ignore the fact that user preferences can change according to context, resulting in recommendations that do not fit user interests. This research addresses these issues by proposing the ({ CF})^2 architecture, which uses local learning techniques to embed contextual awareness into collaborative filtering models. The proposed architecture is demonstrated on two large-scale case studies involving over 130 million and over 7 million unique samples, respectively. Results show that contextual models trained with a small fraction of the data provided similar accuracy to collaborative filtering models trained with the complete dataset. Moreover, the impact of taking into account context in real-world datasets has been demonstrated by higher accuracy of context-based models in comparison to random selection models.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1007/s10791-018-9334-1
A systematic approach to normalization in probabilistic models
  • Jan 1, 2018
  • Information Retrieval
  • Aldo Lipani + 3 more

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10791-014-9245-8
Guest editorial: Special issue on information retrieval in the intellectual property domain
  • Sep 10, 2014
  • Information Retrieval
  • Allan Hanbury + 4 more

Intellectual property (IP) management consists of all the processes linked to the recognition, the publication and the exploitation of creations of the human mind. Patents on inventions are granted by 74 patent offices worldwide under the condition that the patent claims are new, inventive and show industrial applications. Information retrieval (IR) has been recognized as the computing field appropriate to aid in assessing the novelty aspect of the IP processes by proposing methods to search, compare and analyse patent documents. Patents are complex search objects. They attempt to describe that most elusive concept, the ‘‘invention,’’ in a manner that distinguishes it from what has gone before. Unfortunately for the inventor, many of the modern methods for communicating new technical information, such as video, animations, CAD-CAM displays or even, most recently, 3D printing, are little accepted within the patent system, with the result that many new patent

  • Open Access Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1007/s10791-014-9243-x
Distance matters! Cumulative proximity expansions for ranking documents
  • Jul 12, 2014
  • Information Retrieval
  • Jeroen B P Vuurens + 1 more

In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1007/s10791-014-9242-y
Evaluating hierarchical organisation structures for exploring digital libraries
  • Jul 11, 2014
  • Information Retrieval
  • Mark M Hall + 5 more

Search boxes providing simple keyword-based search are insufficient when users have complex information needs or are unfamiliar with a collection, for example in large digital libraries. Browsing hierarchies can support these richer interactions, but many collections do not have a suitable hierarchy available. In this paper we present a number of approaches for automatically creating hierarchies and mapping items into them, including a novel technique which automatically adapts a Wikipedia-based taxonomy to the target collection. These approaches are applied to a large collection of cultural heritage items which is formed through the aggregation of other collections and for which no unified hierarchy is available. We investigate a number of novel user-evaluated metrics to quantify the hierarchies' quality and performance, showing that the proposed technique is preferred by users. From this we draw a number of conclusions as to what makes a hierarchy useful to the user.

  • Research Article
  • Cite Count Icon 5
  • 10.1007/s10791-014-9241-z
Identifying top news stories based on their popularity in the blogosphere
  • May 14, 2014
  • Information Retrieval
  • Yeha Lee + 1 more

A huge volume of news stories are reported by various news channels, on a daily basis. Subscribing to all the stories and keeping track of the important ones day after day is very time-consuming. This paper proposes several approaches to identify important news stories. To this end, we take advantage of the blogosphere as an information source to evaluate the importance of news stories. Blogs reflect the diverse opinions of bloggers about news stories, and the attention that these stories receive can help estimate the importance of the stories. In this paper, we define the popularity of a news story in the blogosphere as the attention it attracts from users. We measure popularity of the stories in the blogosphere from two viewpoints: content and a timeline. In terms of content, we suggest several approaches to estimate language models for a news story and blog posts, and we evaluate the importance of the story using these language models. Furthermore, we generate a temporal profile of a news story by exploring the timeline of blog posts related to the story, and evaluate its importance based on the temporal profile. We experimentally verify the effectiveness of the proposed approaches for identifying top news stories.

  • Research Article
  • Cite Count Icon 28
  • 10.1007/s10791-014-9240-0
A survey of approaches for ranking on the web of data
  • Apr 4, 2014
  • Information Retrieval
  • Antonio J Roa-Valverde + 1 more

Ranking information resources is a task that usually happens within more complex workflows and that typically occurs in any form of information retrieval, being commonly implemented by Web search engines. By filtering and rating data, ranking strategies guide the navigation of users when exploring large volumes of information items. There exist a considerable number of ranking algorithms that follow different approaches focusing on different aspects of the complex nature of the problem, and reflecting the variety of strategies that are possible to apply. With the growth of the web of linked data, a new problem space for ranking algorithms has emerged, as the nature of the information items to be ranked is very different from the case of Web pages. As a consequence, existing ranking algorithms have been adapted to the case of Linked Data and some specific strategies have started to be proposed and implemented. Researchers and organizations deploying Linked Data solutions thus require an understanding of the applicability, characteristics and state of evaluation of ranking strategies and algorithms as applied to Linked Data. We present a classification that formalizes and contextualizes under a common terminology the problem of ranking Linked Data. In addition, an analysis and contrast of the similarities, differences and applicability of the different approaches is provided. We aim this work to be useful when comparing different approaches to ranking Linked Data and when implementing new algorithms.

  • Research Article
  • Cite Count Icon 7
  • 10.1007/s10791-014-9239-6
Dealing with temporal variation in patent categorization
  • Mar 25, 2014
  • Information Retrieval
  • Eva D’hondt + 5 more

In this paper, we quantify the existence of concept drift in patent data, and examine its impact on classification accuracy. When developing algorithms for classifying incoming patent applications with respect to their category in the International Patent Classification (IPC) hierarchy, a temporal mismatch between training data and incoming documents may deteriorate classification results. We measure the effect of this temporal mismatch and aim to tackle it by optimal selection of training data. To illustrate the various aspects of concept drift on IPC class level, we first perform quantitative analyses on a subset of English abstracts extracted from patent documents in the CLEF-IP 2011 patent corpus. In a series of classification experiments, we then show the impact of temporal variation on the classification accuracy of incoming applications. We further examine what training data selection method, combined with our classification approach yields the best classifier; and how combining different text representations may improve patent classification. We found that using the most recent data is a better strategy than static sampling but that extending a set of recent training data with older documents does not harm classification performance. In addition, we confirm previous findings that using 2-skip-2-grams on top of the bag of unigrams structurally improves patent classification. Our work is an important contribution to the research into concept drift for text classification, and to the practice of classifying incoming patent applications.

  • Research Article
  • Cite Count Icon 14
  • 10.1007/s10791-014-9238-7
Using query logs of USPTO patent examiners for automatic query expansion in patent searching
  • Feb 22, 2014
  • Information Retrieval
  • Wolfgang Tannebaum + 1 more

In the patent domain significant efforts are invested to assist researchers in formulating better queries, preferably via automated query expansion. Currently, automatic query expansion in patent search is mostly limited to computing co-occurring terms for the searchable features of the invention. Additional query terms are extracted automatically from patent documents based on entropy measures. Learning synonyms in the patent domain for automatic query expansion has been a difficult task. No dedicated sources providing synonyms for the patent domain, such as patent domain specific lexica or thesauri, are available. In this paper we focus on the highly professional search setting of patent examiners. In particular, we use query logs to learn synonyms for the patent domain. For automatic query expansion, we create term networks based on the query logs specifically for several USPTO patent classes. Experiments show good performance in automatic query expansion using these automatically generated term networks. Specifically, with a larger number of query logs for a specific patent US class available the performance of the learned term networks increases.