Contributor

*Decision Scientist at Flipkart*

Time to get to business! We have described the problem setting of Multi-Armed Bandit in the previous article and also talked about ideas of application to the real world.

Modelling the given Problem:

The multi-armed bandit can be seen as a set of real distributions each distribution being associated with the rewards delivered by one of the levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximise the sum of the collected rewards. The horizon is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret after rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards: where is the maximal reward mean, and is the reward in round *t*.

A *zero-regret strategy* is a strategy whose average regret per round tends to zero when the number of played rounds tends to infinity. ^{}Intuitively, zero-regret strategies are guaranteed to converge to a (not necessarily unique) optimal strategy if enough rounds are played.

With respect to the advertisement use case discussed in the previous article, the above can be applied to the setting as follows:

The task here would be to find the best ad to be displayed with respect to the underlying distribution so as to maximise the total expected reward.

Solutions

The solutions for the MAB problem can be targeted in two different ways:

1. Asymptotic Correctness: As T -> ∞, the probability of picking the best arm -> 1.

2. Regret Optimality: The total regret accumulated over time is minimised.

Graphical explanation of the above – courtesy NPTEL

Optimal Solution: A major breakthrough was the construction of optimal population selection strategies or policies (that possess uniformly maximum convergence rate to the population with highest mean) in the work described below.

“Asymptotically efficient adaptive allocation rules“, Lai and Robbins^{
“Optimal adaptive policies for Markov decision processes“}

# Approximate Solution

Semi-uniform strategies: Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behaviour where the *best *lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.

**Epsilon-greedy Strategy:** This is the simplest strategy and involves the dilemma of exploitation and exploration. The idea is to sample the arms for initial few steps and based on the estimate of the return value, for every other iteration with a probability of 1-ε choose the best performing arm i.e. the one with the maximum reward estimate and with a probability of ε sample any arm uniformly from within the entire set of Arms. The value of ε is decayed over iterations to ensure convergence to the best performing arm.

Regret Bound Algorithms:

**UCB and variants:** This strategy establishes a very popular result of the MAB setting. It gives the upper bound on the regret over time using a set of smart theorems. It is a very important result because following this algorithm we can ensure that the regret over time cannot grow over a*log(T) where ‘a’ is a scalar constant. Also, a lot of variants of UCB play with this scalar constant to look for improvements.

UCB intuition

Full derivation – courtesy NPTEL

Probability Matching Strategies :

**Thompson Sampling:** The Thompson sampler differs quite fundamentally from the e-greedy algorithm in three major ways:

- it is not greedy;
- its exploration is more sophisticated, and;
- it is Bayesian

1 & 2 are because of 3. Recent research on Thompson Sampling has caught a lot of eyes because it has proved to achieve bounds similar to those of UCB. Hence it remains an active area of research with a lot of scopes.

Thompson Sampling intuition

Pricing Strategies:

Pricing strategies establish a *price* for each lever. For example, as illustrated with the POKER algorithm,^{} the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of the highest price is always pulled.

Strategies with ethical constraints:

These strategies minimise the assignment of any patient to an inferior arm (“physician’s duty”). In a typical case, they minimise expected successes lost (ESL), that is, the expected number of favourable outcomes that were missed because of assignment to an arm later proved to be inferior. Another version minimises resources wasted on any inferior, more expensive, treatment.^{}

MAB though a huge branch on its own is just a trailer to the RL movie coming ahead. Hold on for even more exciting and fun Math along with some cool applications and examples. See you next time! Cheers!

#### Other articles of the series:

N. 1 Reinforcement Learning – The Art of Interactive Learning

N. 2 Supervised vs Unsupervised vs Reinforced

N. 3 Multi-Armed Bandit – Reinforcement Learning

Contributor

*Decision Scientist at Flipkart*