The permutahedron is the convex polytope with vertex set consisting of the vectors (π(1),…,π(n)) for all permutations (bijections) π over {1,…,n}. We study a bandit game in which, at each step t, an adversary chooses a hidden weight vector st, a player chooses a vertex πt of the permutahedron and suffers an observed instantaneous loss of ∑i=1nπt(i)st(i).We study the problem in two different approaches. In the two approaches, we assume that st is a point in the polytope dual to the permutahedron. Algorithm CombBand of Cesa-Bianchi et al. (2012) guarantees a regret of O(nTlogn) after T steps. Unfortunately, CombBand requires at each step an n-by-n matrix permanent computation, a #P-hard problem. Approximating the permanent is possible in the impractical running time of O(n10), with an additional heavy inverse-polynomial dependence on the sought accuracy. In the first approach, we provide an algorithm of slightly worse regret O(n3/2T) but with more realistic time complexity O(n3) per step. The technical contribution is a bound on the variance of the Plackett–Luce noisy sorting process's ‘pseudo loss’, obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices of rational functions in exponents of 3 parameters.In the second approach, we present and analyze an algorithm based on Bubeck et al.'s (2012) OSMD approach with a novel projection and decomposition technique for the permutahedron. The second algorithm's running time and regret guarantees are similar to our first algorithm, modulo a numerical line search procedure the running time of which we have not been able to analyze. It is interesting that the two approaches are totally different.The main open problem from this work is whether there exists a bandit algorithm for this problem with both optimal regret of O(nT) and running time of O(n3) for either regime, or there is an inherent tradeoff between the two performance measures.
Read full abstract