Multi-Armed Bandit – Reinforcement Learning

Let’s Go to the Casino! Let’s Pull a Lever!

Welcome back folks! The links to the first 2 articles are here:

Article 1: Reinforcement Learning – The Art of Interactive Learning
Article 2: Supervised vs Unsupervised vs Reinforced
Article 4: Class of Algorithms for Solving Multi-Armed Bandit

With this article, we will begin our exciting journey into RL. And to cut the crap out, let’s GO!
A series of lectures on the Bandit problems are available here: 
The all-time classic book for whole RL is available here: Sutton and Barto

MAB (Multi-Armed Bandit) – A single state RL episode!

The term multi-armed bandit comes from the slot machines of casinos. Something like this!

Multi-Armed Bandit

A slot machine has a pull-down lever and pulling on it randomly samples a reward associated with the distribution allotted to that particular slot machine. The MAB problem is to find out the maximum expected reward bearing arm or maximize total reward collected over multiple time steps given K such slot machines each having a stationary or non-stationary reward distribution associated with it.

It is a single state RL problem implying that picking an action which in this case means choosing and pulling down an arm generates the associated reward immediately and the episode is over. There is no transition of state or continuity in the episode upon the action. The reward is available directly as soon as the action is made unlike other settings wherein the reward for an action may come later in time. For instance, buying a stock might not be fruitful in the short run but might be so for the long term. The delay in reward thus creates 2 classes of RL problems:

Immediate RL -> Non-delayed rewards
Full RL -> Delayed Rewards

Problem Applications – Arguably an Alternative for A|B Testing?

  1. News Recommender System: For every user visiting the website,  advertisements are shown. Now, the website has to be clever and pick a news article from a set of articles such that they get a reward which happens whenever the user clicks the article. Every user here will lead to a distinct MAB setting and thus separate RL agents need to be learned for separate users. Clustering of users can help in minimising the number of RL Agents. As the goal of the site is to maximise its revenue, it wants to show the articles that are most likely to get a click. The exploration and exploitation dilemma arises in the above scenario and can be modeled as a MAB problem.
    Advertising application
  2. Medical/Clinical trials and Medicine testing: In this application, different experimental treatments are available. Also, research in medicine leads to the launch of newer and advanced medicine. However, the medicine or the treatment has to be tested. The planner has to decide which treatment to use while minimising the patient’s losses. The treatments are experimental implying that the capability of the treatment has to be learned by performing it on the patients. The above problem can be modeled as a MAB problem with treatments as arms and the efficiency of the treatment is to be learned. The planner can either explore the arms to learn its success rate(efficiency) of the arms or choose to exploit the arm with the best success rate till now.
  3. Crowdsourcing: Crowdsourcing has emerged as a powerful means available to employers or service requesters to get tasks completed in a timely and cost-effective manner. The employer’s aim is to maximise the number of the tasks completed. The available workers(or employees or freelancers) act as arms, in this case, having hidden qualities unknown to the employer. So the problem can be posed as a MAB problem where the employer can either explore the arms to learn their qualities or choose to exploit the best arm identified till now.

Some other applications of MAB are in financial portfolio design, adaptive routing in a network, smart grids, etc.
In the article to follow we will discuss the class of algorithms and various methods of solving the MAB problem. Stick around for the REAL FUN! Cheers!

Opinions expressed by AI Time Journal contributors are their own.

About Prateek Singhi

Contributor Decision Scientist at Flipkart

View all posts by Prateek Singhi →