The exploration-exploitation trade-off is at the heart ofany sequential decision-making process in uncertainenvironments, such as foraging for food [1] or learningto cycle. This trade-off characterises the balancebetween reaping the benefits of a known solution(which may or may not be optimal) and continuing tosearch in hope for better solutions. It is difficult todirectly infer or measure this trade-off from behaviouraldata in humans (or animals) on a trial-by-trial basis.Here, we use a set of reinforcement learning (RL) algo-rithms as ideal actor models to infer the internal repre-sentation of this trade-off from behavioural data. RLalgorithms are well known to predict essential featuresof human and animal learning behaviour, and moreoverthese algorithms were shown to have direct electrophy-siological and molecular correlates (such as reward anderror learning signals) (see [2]).To demonstrate our approach we conducted motorlearning experiments based on a preliminary set of N=5right-handed subjects (20-30 years of age). Our taskconsisted of computer-based psychophysics experiments(4 blocks of 50 trials) using a small grid world (5 statesin 2x2 arrangements of states with a surrounding term-inal state) in which subjects were to reach from theirstarting position to a goal state (Reward“$10”) whileavoiding the terminal state (Reward“-$10”). Subjectswere allowed to act in this grid world by moving in 4cardinal directions under stochastic dynamics (actionshad in each block probabilities of 0.7,0.8,0.9 or 1 a prob-ability of moving in the chosen direction). However, toexclude human context knowledge that RL algorithmsdo not possess, we represented states by colours (so asto mask the spatial structure of the world). Subjectsmoved either in the form of unlabelled button pressesor abstract gestures on a Wii Remote (NINTENDO,Kyoto, Japan). We found that both humans and our RLalgorithms (TD, Q-Learning) required nearly the sameamount of episodes to reach comparable performance.The exploration-exploitation trade-off is formalised asa fundamental parameter of our two model-free RLalgorithms, the so-called epsilon-greediness (which isthe probability with which the learner will choose a sub-optimal action to explore other solutions). We inferredthe implied human exploration-exploitation trade-offparameter by directly imposing the human state-actionpairs on the algorithms. This allowed us to infer theinternal representation of the subject’s optimal policy aswell as the epsilon-greediness parameter (under theassumption that humans are learning using the corre-sponding algorithms based on the information availableup to that episode). For example, we find for the taskwhere the stochastic dynamics probability was 0.8,which the epsilon-greediness of human subjectsincreased rapidly in an initial exploration-intensivephase to values of 0.55 over the first 10 episodes. Oncea near optimal solution was found, their epsilon-greedi-ness decreased rapidly and stabilised back around 0.3.These results show how we can gain insight intoimportant parameters of reward-based learning pro-blems from behavioural data. This approach may allowneurophysiologists to find neuronal correlates ofexploration-exploitation trade-offs in the nervous systemin sequential decision making tasks.