Feature selection for ranking using boosted trees

Feng Pan,Franco Salvetti,David Ahn,Gianluca Donato,Tim Converse

doi:10.1145/1645953.1646292

Abstract

Modern search engines have to be fast to satisfy users, so there are hard back-end latency requirements. The set of features useful for search ranking functions, though, continues to grow, making feature computation a latency bottleneck. As a result, not all available features can be used for ranking, and in fact, much of the time, only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. To this end, we explore different feature selection methods using boosted regression trees, including both greedy approaches (selecting the features with highest relative importance as computed by boosted trees; discounting importance by feature similarity and a randomized approach. We evaluate and compare these approaches using data from a commercial search engine. The experimental results show that the proposed randomized feature selection with feature-importance-based backward elimination outperforms greedy approaches and achieves a comparable relevance with 30 features to a full-feature model trained with 419 features and the same modeling parameters.

Full Text