[This is the English version translated from Chinese.]
Reinforcement learning is a branch of machine learning that is particularly suitable for controlling an
autonomous agent who interacts with an
environment. It uses sensory
perception and
rewards to continually modify its
behavior.
When you hear reinforcement learning, it should invoke an imagery in your mind of a little critter such as this cockroach:
As for the idea of "
environment", you should think of a maze such as in this classic game:
It includes some monsters chasing after you, some food that you could eat to increase your score (these represent negative and positive rewards). Of course, in real applications the "environment" and "rewards" can be abstract; This game is just a concrete example.
What goes in and out?
Keep in mind, the inputs to the reinforcement learning algorithm are the following:
- States = the environment
For example: each square in the maze is a state
- Actions = under each state, which actions are allowed?
- Rewards = upon visiting each new state, there is an associated positive or negative utility
And the output of the algorithm is:
- Policy = under each state, which action would you choose?
So this 4-tuple (S, A, R, P) constitutes a reinforcement learning system. In abstract algebra, we often use such tuples to define systems or structures.
A more concrete example is:
- states S = each square of the maze, which can be represented by its coordinates, eg (1,3)
- actions A = on each square of the maze, you can go either { up, down, left, right }
- rewards R = under the current state, the square of the maze may contain food (+1) or a monster (-100)
- policy P = a function from states $\rightarrow$ actions, ie, given any state, it returns with an action.
(S, A, R) are defined by the user, P would be automatically calculated by the algorithm.
Between humans and insects
The first question that comes to mind: why not just use this technique to build strong AI? But the current state of the art of reinforcement learning can only handle relatively small and simple environments, it is still rather helpless when faced with larger and more complex worlds, such as the $10^\mbox{xxx}$ state space of chess.
The key lies in the fact that, higher intelligent beings build up knowledge or world models in their brains, whereas naive reinforcement learning is only concerned with state-action pairs.
The leading researcher in reinforcement learning, Richard Sutton, thinks that RL is the only learning technique that takes into consideration elements such as autonomous agents, environments, and rewards, therefore it should be the
top-level architecture in an AI system, where other modules such as logic and pattern recognition should subsume under its control. I think this makes a lot of sense.
Thus to build a strong AI, a possible scheme is to combine reinforcement learning with a certain ability to handle complex world models:
"You have made your way from worm to man, and much in you is still worm." -- Nietzsche, Thus spoke Zarathustra
"If men cease to believe that they will one day become gods then they will surely become worms." -- Henry Miller
Code
Learning AI is boring without programs. Here is a simple demo that I found on the web, the author is Travis DeWolf:
https://studywolf.wordpress.com/2012/11/25/reinforcement-learning-q-learning-and-exploration/
It only requires Python to run, but you may need to install PyGame first.
Cat,
mouse,
cheese:
The cat's behavior is simply to move towards the mouse (this is without intelligence), the mouse's behavior is learned by the algorithm.
Note that, in the main program and in
cellular.py, the behavior of the maze world is defined, but this is purely a game program that has no intelligence in it. You can use { up, down, left, right } to control the agents' movements, and that's all of it.
The reinforcement learning algorithm is in
qlearn.py, which is very short, and the actual learning code is essentially just one line, that is:
def learnQ(self, state, action, reward, value):
oldv = self.q.get((state, action), None)
if oldv is None:
self.q[(state, action)] = reward
else:
self.q[(state, action)] = oldv + self.alpha * (value - oldv)
Just this one line of code, is able to make the mouse avoid the cat, and eat the cheese. It will be explained below...
How does RL work?
Chapter 21 of《AI: a modern approach》has a very good introduction to RL. 《AIMA》is of course the classic AI textbook, many people say that they love AI after studying this book. The strength of this book is that it uses plenty of words to explain all concepts and principles patiently, so the reader would not feel disorganized. For example, Chapter 21 first explains
passive reinforcement learning, which means to hold the policy
fixed, and then simply calculate the agent's expected utility (ie the total rewards). Having established this foundation then we compare different policies. This way of thinking is very common in mathematics: First consider a case so simple that any idiot can solve, and then gradually introduce more complexity. For example in mathematical induction, to move from case N=1 to N→∞.
To avoid repeating the book, I will just explain the minimum knowledge required to understand Q learning.
Utility
U is the sum of rewards after a
sequence of actions. For example, the utility of playing one chess move is not just the
immediate reward of that move, but also includes the
consequences after that move. For example, a player may greedily eat a pawn, but then got checkmated 10 moves later. Or, faced with delicious food, some people may choose not to eat it, for fear of getting fat.
The
utility of a state is: assuming the policy is fixed, and considering all possible future transitions starting from this state, the average expectation of the total reward:
$$ U(S_0) = \mathbb{E}[ \; \sum_{t=0}^{\infty} \; \gamma^t \; R(S_t) \; ] $$
where $\mathbb{E[\;]}$ is expectation value, $\gamma$ is the discount factor, such as 0.9 or something.
Real example: consider a maze like this one:
The arrows represent one
policy out of many possible policies.
According to this policy, starting from (1,1) we may get this
outcome:
The orange number below is the
reward for each state. In this example, every square that is not the goal, will have 0.04 deducted from the score.
However, even starting from the same state, using the exact same policy, can give different outcomes. For example, when the creature tries to crawl left from (1,3), the actual outcome may be falling one square downwards. Such
state transitions are dictated by probabilities from the external world. (For example, someone gets a university degree, but the economy falls into recession, his actual salary may not be as high as he has expected).
Another outcome of the same policy could be:
Or this:
Bellman condition
The Bellman optimality condition is the
central idea in
dynamic programming.
The artificial intelligence community calls it
reinforcement learning, but in control theory it is called
dynamic programming; the two are synonymous. Richard Bellman in 1953 proposed this equation, while he was working at RAND corporation, dealing with operations research problems. He also coined the phrase "curse of dimensionality" to describe the major obstacle in dynamic programming.
The problem we try to solve is: to make a series of
sequential decisions.
The Bellman equation says: "
if we cut off a tiny bit from the endpoint of the optimal path, the remaining path is still an optimal path between the new endpoints."
In other words, if we have made a sequence of
optimal decisions A B C D E…… then this sequence, with the endpoint A removed, = B C D E …… is still an optimal decision sequence when applied to the
rest of the states.
(For example, you plan to drive from Los Angeles to New York:
$$ \mbox{Los Angeles} \rightarrow \mbox{Las Vegas} \rightarrow ... ... \rightarrow \mbox{New York} $$
and you have chosen the cheapest route, which passes through 6 stops, the 2nd stop being Las Vegas. Now if you remove the starting point Los Angeles, then the remaining route:
$$ \mbox{Las Vegas} \rightarrow ... ... \rightarrow \mbox{New York} $$
would still be the cheapest route for the remaining stops.)
Mathematically:
$$ U^*(S) = \max_a \{ R(a) + U^*(S') \} $$
$$ U^*(\mbox{whole path}) = \max_a \{ R(\mbox{choose action $a$ in current state}) + U^*(\mbox{rest of path}) \} $$
$*$ denotes "optimal". This seemingly simple formula is the entire content of dynamic programming; What it means is that: When seeking the path with the best value, we cut off a bit from the path, thus reducing the problem to a smaller problem; In other words, it is a recursive relation over time.
Delta rule
This is just a simple trick, appearing often in machine learning. Suppose we have an ideal value, we want to gradually adjust the current value so that it asymptotically approaches this ideal. What we do is:
$$ \mbox{current value} := \mbox{current value} + \alpha ( \mbox{ideal value} - \mbox{current value}) $$
where $\alpha$ is called the "learning rate". "Delta" ($\Delta$) referes to the difference between the ideal and current values.
Obviously, as we repeatedly perform the above step, the current value will get closer and closer to the ideal value.
(The differential form of the Delta rule is the familiar gradient descent: $x \mbox{ += } \eta \cdot \frac{dy}{dx}$)
Temporal difference (TD) learning
Now we apply the Delta rule on the Bellman equation to find the optimal path; This is called
temporal difference learning.
Again we start with the simple case: assume policy is fixed, the aim is to learn the utility of all states.
Ideally, the value of U(S) can be obtained by: starting from state S, trying all possible state transitions, summing the total rewards along these paths, and finding the average total reward among all paths.
But in
reality, our agent can only experience
one state transition after each action.
So we need to apply the Bellman equation: The U value of a state S is equal to: its own reward, added with the U values of all possible
subsequent states, taking the probabilistic average (ie expectation value), and multiplying with the discount factor $\gamma$:
$$ U(S) = R(S) + \gamma \sum_{S'} P(S \rightarrow S') \; U(S') $$
where P is the transition probability, S' is the subsequent state, $\sum$ is the summation over all subsequent states. In other words, this is a relation between the ideal U(S) and U(subsequents of S), as a
recursive relation.
For example, assume the agent's estimations of state (1,3) and state (2,3) are respectively 0.84 and 0.92. Also the agent observes that under the current policy, the transition from (1,3) to (2,3) always occurs. Then the U values of these 2 states should satisfy this constraint:
$$ U(1,3) = -0.04 + U(2,3) \;.$$
In other words, this is the
local constraint on the U values between 2 states.
The idea of TD learning: assume other U(S') estimates are correct, use the Bellman optimality constraint to
adjust the U(S) of the current state. As the number of trials grows large, all the U estimates will approach ideal. The agent only needs to use this update rule:
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma U(S') - U(S) \; ) $$
$\alpha$ is the
learning rate, it determines the speed of learning (but it cannot be too large, lest
overshooting may occur). The stuff after $\alpha$ is the difference between the current estimated U(S) and the U(S) estimated using ideal constraint. For the ideal U(S) and U(S'), this difference would be 0. Whereas for each time step, we just use $\alpha$ to
partially adjust this difference.
Lastly I should mention that, in the above equation of the ideal constraint, there is a summation over all probabilities P, but the P disappeared in the update formula. That is because the agent is acting in the environment, and this
implicitly performs a
sampling of the state transition probabilities. In other words, the summation is performed by the agent itself.
P is the probabilities of state transitions, in other words a
model of the world. TD learning does not need to learn P, that is why it is called
model-free learning. But just as we have said in the beginning, model-free is not necessarily a good thing. Human intelligence lies in the fact that we do have some pretty good models of the external world.
Q value
Q value is just a variation of U value; there is a U value for each state, and Q is the
decomposition of U
by all the actions in that state. In other words, Q is the utility of doing action A in state S.
What is Q good for? Below we will introduce
active learning, and Q values, when combined with TD learning, can eliminate P in active learning, thus achieving the model-free effect.
The relation between Q and U is:
$$ U(S) = \max_A Q(A, S) $$
The above update rule needs only be revised with this relation:
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma \max_{A'} Q(A', S') - Q(A, S) \; ) $$
Active learning
In passive learning, with the policy fixed, we can already calculate the utility U(S) for each state S, or the utility Q(S,A) of doing action A under each state S.
If the policy can change, we just need to calculate the Q values for different policies, then choose the action A corresponding to the maximum Q value in each state, that would be the best policy, right?
When we actually execute the above, we find that the agent's policies are really bad! The reason being that, during the learning process, the Q values are estimates rather than ideals, and if we act according to such Q's, the agent becomes very short-sighted, and fails to find the optimal policy. (For example, a person always goes to the same restaurant, but if he walks a different path, he could find a better restaurant.)
The agent needs to try some unknown states or actions in order to learn the optimal policy; This is the so-called exploration vs exploitation trade-off.
The way to do this, is to artificially increase the utilities of unknown states:
$$ U(S) = R(S) + \gamma \max_A \mathcal{F}[ \; \sum_{S'} P(S \rightarrow S') U(S'), \; N(A, S) \; ] $$
where N(A, S) is the number of times the combination of state S and action A has occurred (= has been experienced), $\mathcal{F}$ is the exploration function, it normally returns the estimate of U, but when N is small (that means we have little experience of S, A), it will return a relatively larger estimate, which represents the utility of "curiosity".
Conclusion
I started trying to write an introduction to RL that everyone can understand, but found out that I could barely explain all the things after such a great length. Hope my girl friend can read and understand this :)