Learning and conditional probability: a summary
\[ \DeclarePairedDelimiters{\set}{\{}{\}} \DeclareMathOperator*{\argmax}{argmax} \]
The previous chapter 17 Conditional probability and learning discussed many concepts that are important for what follows, and for artificial intelligence and machine learning in general. So let’s stop for a moment to emphasize and point out some things to keep in mind.
What does it means that an agent has “learned”? It means that the agent has acquired new information, knowledge, or data. But this acquisition is not just some passive memory storage. As a result of this acquisition, the agent modifies its degrees of belief about any inferences it needs to draw, and consequently may make different decisions ([ch.@sec-basic-decisions]).
This change is an important aspect of learning. Think of when a person receives useful information; but the person, in actions or word, doesn’t seem to make use of it. We typically say “that person hasn’t learned anything”.
The relation between the acquired knowledge and the change in beliefs is perfectly represented and quantified by conditional probabilities. These probabilities take account of the acquired information in their conditionals (the right side of the bar “\(\nonscript\:\vert\nonscript\:\mathopen{}\)”). And the probability calculus automatically determines how to calculate the modified belief based on this new conditional.
In other words, the probability calculus already has everything we need to deal with and calculate “learning”. This is therefore the optimal, self-consistent way to deal with learning. We may use approximate versions of it in some situations, for instance when the computations would be too expensive. But we must keep in mind that such approximations are also deviations from optimality and self-consistency.
- The formula for conditional probability – that is, for the belief change corresponding to learning – involves and requires a joint distribution over several possibilities ([ch.@sec-prob-joint]). Therefore such distribution must be somehow built into the agent from the beginning, for the agent to be able to learn.
The beliefs and behaviour arising from learning can be very different, depending on the context. For example, in some situations frequently observing a phenomenon may increase an agent’s belief in observing that phenomenon again; but in other situations such frequent observation may decrease an agent’s belief instead. Both kinds of behaviour can make sense in their specific circumstances.
These differences in behaviour are also encoded in the joint distribution built into the agent.
The formula for belief change arising from learning is amazingly flexible and universal: it holds whether the agent is learning about different kinds of quantities or about past instances of similar quantities.
From a machine-learning point of view, this must therefore be the formula underlying the use of “features” by a classifier, as well as its “training”.
This formula is moreover extremely simple in principle: it only involves addition and division! Computational difficulties arise from the huge amount of terms that may need to be added in specific data-science problems, not because of complicated mathematics. (A data engineer should keep this in mind, in case new hardware technologies may make it possible to deal with larger number of terms.)