There are three paradigms in ML namely Supervised Learning (SL), Unsupervised Learning (UL) and Reinforcement Learning (RL). The two paradigms of SL & UL involves perception, classification, regression, and clustering – and makes decisions whereas RL goes beyond them but utilizes the supervised and unsupervised ML methods in doing so. Therefore, RL is a distinct yet closely related field to SL and UL
Supervised learning
SL is about learning a mathematical function that maps a set of inputs to the corresponding outputs/labels as accurately as possible. The idea is that we don’t know the dynamics of the process that generates the output, but we try to figure it out using the data coming out of it.
For example:
- An image recognition model that classifies the objects on the camera of a self-driving car as a pedestrian, stop sign, truck, and so on
- A forecasting model that predicts the customer demand of a product for a particular holiday season using past sales data
It is extremely difficult to come up with the precise rules to visually differentiate objects, or what factors lead to customers demanding a product. Therefore, SL models infer them from labeled data.
Key points about how it works:
- During training, models learn from ground truth labels/output provided by a supervisor (which could be a human expert or a process).
- During inference, models make predictions about what the output might be given the input.
- Models use function approximators to represent the dynamics of the processes that generate the outputs.
Unsupervised learning
UL algorithms identify patterns in data that were previously unknown. When using these models, we might have an idea of what to expect as a result, but we don’t supply the models with labels.
For example:
- Identifying homogenous segments on an image provided by the camera of a self-driving car. The model is likely to separate the sky, road, buildings, and so on based on the textures on the image.
- Clustering weekly sales data into three groups based on sales volume. The output is likely to be weeks with low, medium, and high sales volumes.
As you can tell, this is quite different from how SL works, namely in the following ways:
- UL models don’t know what the ground truth is, and there is no label to map the input to. They just identify the different patterns in the data. Even after doing so, for example, the model would not be aware that it separated the sky from the road, or a holiday week from a regular week
- During inference, the model would cluster the input into one of the groups it had identified, again, without knowing what that group represents
- Function approximators, such as neural networks, are used in some UL algorithms, but not always
Reinforcement learning
RL is a framework to learn how to make decisions under uncertainty to maximize a long-term benefit through trial and error. These decisions are made sequentially, and earlier decisions affect the situations and benefits that will be encountered later. This separates RL from both SL and UL, which don’t involve any decision-making.
For example:
- For a self-driving car, given the types and positions of all the objects on the camera, and the edges of the lanes on the road, the model might learn how to steer the wheel and what the speed of the car should be to pass the car ahead safely and as quickly as possible
- Given the historical sales numbers for a product and the time it takes to bring the inventory from the supplier to the store, the model might learn when and how many units to order from the supplier so that seasonal customer demand is satisfied with high likelihood, while the inventory and transportation costs are minimized.
As you will have noticed, the tasks that RL is trying to accomplish are of a different nature
and more complex than those simply addressed by SL and UL alone. Let’s elaborate on
how RL is different:
- The output of an RL model is a decision given the situation, not a prediction or clustering.
- There are no ground-truth decisions provided by a supervisor that tell the model what the ideal decisions are in different situations. Instead, the model learns the best decisions from the feedback from its own experience and the decisions it made in the past. For example, through trial and error, an RL model would learn that speeding too much while passing a car may lead to accidents, and ordering too much inventory before holidays will cause excess inventory later.
- RL models often use outputs of SL models as inputs to make decisions. For example, the output of an image recognition model in a self-driving car could be used to make driving decisions. Similarly, the output of a forecasting model is often used as input to an RL model that makes inventory replenishment decisions.
- Even in the absence of such input from an auxiliary model, RL models, either implicitly or explicitly, predict what situations its decisions will lead to in the future
- RL utilizes many methods developed for SL and UL, such as various types of neural networks as function approximators.
Reinforcement Learning is Humane
Think of a toddler learning how to construct a tower out of toy blocks. Typically, the child is happier when the tower is higher. Every increment in height is success and collapse is a failure.
They quickly figure out that the closer the next block is to the center of the one beneath, the more stable the tower is. This is reinforced when a block that is placed too close to the edge more readily topples.
With practice, they manage to stack several blocks on top of each other. They realize how they stack the earlier blocks, creates a foundation that determines how tall of a tower they
can build. Thus, they learn.
Of course, the toddler did not learn these architectural principles from a blueprint. They learned from the commonalities in their failure and success. The increasing height of the tower or its collapse provided a feedback signal upon which they refined their strategy accordingly. Learning from experience, rather than a blueprint, is at the center of RL.
Just as the toddler discovers which block positions lead to taller towers, an RL agent identifies the actions with the highest long-term rewards through trial and error. This is what makes RL such a profound form of AI; it’s unmistakably human.