Abstract

In the era of increasing choices, recommender systems are becoming indispensable in helping users navigate the million or billion pieces of content on recommendation platforms. As the focus of these systems shifts from attracting short-term user attention toward optimizing long term user experience on these platforms, reinforcement learning (and bandits) have emerged as appealing techniques to power these systems [5, 9, 26, 27]. The exploration-exploitation tradeoff, being the foundation of bandits and RL research, has been extensively studied [1, 2, 4, 6, 8, 10, 11, 18, 20, 21, 22, 23]. An agent is incentivized to exploit to maximize its return, i.e., by repeating actions taken in the past that produced high rewards. On the other hand, the agent needs to explore previously unseen actions in order to discover potentially better ones. Exploration has been shown to be extremely useful in solving tasks of long horizons or sparse reward in many RL applications [2, 14, 15, 16, 19]. While effective exploration is believed to positively influence the user experience on the platform, the exact value of exploration in recommender systems has not been well established.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.