A Brief Introduction to Reinforcement Learning: A New Approach to A/B Testing

Reinforcement learning is the problem of getting an agent (i.e., software) to act in the world (i.e., environment), so as to maximize its rewards, often using a policy to pick from a set of actions based on contextual information. For example, consider teaching a robot how to accomplish a task: you cannot tell it what to do, but you can reward/punish it for doing the right/wrong thing. Eventually, it has to figure out what it did that made it receive the reward/punishment. Similarly, we can use such methods to train learning machines to do multiple tasks, such as playing backgammon, chess, Atari, scheduling jobs, and even helicopter flight simulations. More specifically, this can be done in the context of web analytics by conducting A/B testing automatically.

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine an ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior – this is known as the reinforcement signal. The key to understanding when to use Reinforcement Learning is when data for learning currently does not exist, or it is impractical to wait to accumulate it or else the data may change rapidly causing the outcome to change more rapidly. Simply put, the Reinforcement Learning algorithm constructs its own data through experience and by determining the ‘champion policy’ through trial and error, and temporal learning. The following diagram illustrates the basic concept underlying the Reinforcement Learning method:

The agent and the environment interact at each a sequence of discrete time steps, t = 0,1,2,3…
At each time step the agent receives some representation of the environment’s state, Sₜ ∈ S, where S is the set of possible states
On that basis, the agent selects an action Aₜ ∈ A (Sₜ)

where A(Sₜ) is the set of actions available in state Sₜ

Step later as a consequence of its actions the agent receives a numerical reward Rₜ₊₁ ∈ R and finds itself in a new state S ₜ ₊₁. The job is to maximize cumalative reward
In such sequential decision process the goal is to select action-value function to maximize a discounted sum of future rewards

A/B testing refers to randomized experiments with subjects randomly partitioned amongst treatments. A more advanced version, ‘multi-variate testing’, runs many A/B tests in parallel. In such website design setting, a set of possible layout options constructs the environment while changing the layout is defined as an action and increased CTR acts as a reward. Ultimately, the policy is to pick the ideal layout to optimize CTR.

It should be noted that the Reinforcement Learning area is an active research domain in the machine learning scholar community. As data for web analytics becomes a valuable resource to enterprises, the marriage with new and evolving artificial intelligence algorithms with web site optimization methodologies holds a big promise for digital marketers.

Dr. Elan Sasson. Intlock LTD

You might also be interested in Kissmetrics vs Mixpanel Tracking Real Funnels.