Keywords: [ Game Theory and Mechanism Design ] [ Learning Theory ] [ Online Learning / Bandits ]

[
Abstract
]

Abstract:
Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a \emph{self-interested} strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint, in order to maximize its own expected number of times of being pulled. We analyze the robustness of three popular bandit algorithms: UCB, $\varepsilon$-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound $\mathcal{O}(\max \{ B, K\ln T\})$ where $B$ is the total budget across arms, $K$ is the total number of arms and $T$ is the running time of the algorithms. This regret guarantee holds for \emph{arbitrary adaptive} manipulation strategy of arms. Our second set of main results shows that this regret bound is \emph{tight}--- in fact, for UCB, it is tight even when we restrict the arms' manipulation strategies to form a \emph{Nash equilibrium}. We do so by characterizing the Nash equilibrium of the game induced by arms' strategic manipulations and show a regret lower bound of $\Omega(\max \{ B, K\ln T\})$ at the equilibrium.