
Some Notes on Dueling Bandits
The dueling bandit problem natrually fits the description of a variety of recommendation systems that require ‘’learning on the fly’’, yet have no explicit access to a ‘‘reward’’ model. Instead, the ‘‘human’’ feedback part takes the form of ‘‘choices’’, ‘‘votes’’, or some discrete forms. Hence, oftentimes a learning protocol proceeds at time steps \(t = 1, \ldots, T\):
- The algorithm chooses a pair of arms \( a_i, a_j \) from \(K\) available ones;
- The oracle/human feedback/nature reveals the winner arm \( a_i \), with probability \( P(a_i \succ a_j)\), \( P(a_i \prec a_j) = 1 - P(a_i \succ a_j) \)
So the feedback is either \(a_i\) or \( a_j \), the preferences forms a matrix \(P \in \mathbb{R}^{K \times K}\) such that \( P + P^{\top} = I\) which defines the hidden information of the dueling bandit problem. The cumulative regret in the stochastic dueling bandit setting is: $$ \mathcal{R}_T = \sum_{t=1}^T P( a^* \succ a^t_i ) + P( a^* \succ a^t_j ) $$ where \(a^* \) is usually the Condorcet winner, i.e., \( P( a^* \succ a_j) > \frac{1}{2} \ \ \forall j \in [K], a_j \neq a^*\). Condorcet winner is a pretty straightforward idea, which might not exist in general cases. To see that, suppose there are three candidates: A, B, and C, and three voters with the following preferences:

Erdos-Szekeres Theorem
This post is dedicated to Erdos-Szekeres Theorem, which says:
ES Thm. (Monotone sequence)
For any positive integer $n$, any sequence of $n^2 + 1$ distinct real numbers contains a monotone subsequence of length at least $n+1$.
This means that within any sufficiently long sequence, there is either an increasing subsequence or a decreasing subsequence of a certain minimum length.
There’s another version of the theorem statement coming from the 1935 paper by Paul Erdos and George Szekeres1, which concerns combinatorial geometry. It says:

Wardrop Equilibrium
I remember having those unpleasant lines in our high school canteen, flooded by the starving students queeing for their lunch, I would always pick a window with fewer people waiting, compromising myself with awful food.
Another similar thought that always stricked me was the odds that tourists always pick the same time to travel, there’s almost always traffic congestion everywhere during holiday seasons. Yeah, lives have been always so hard.
On the first level of thinking, when there’s a lack of resource you attempt to find alternatives, e.g., picking another time for traveling, another food window, etc.. The level two thinking is maybe ‘‘since other people are avoiding holidays, what if I insist on going out on holidays’’? Or maybe there’s a twisted level three thinking, ‘‘what if everybody thinks like the level two thinker, ….’’ It’s somewhat intimidating to follow this infinite hierachy, but definitely rewarding, as there are certainly moments when people are regretting their travel decisions, thinking to themselves ‘‘I probably should not have gone the high way.’’