Models of Behavioural Learning

Nips Workshop at Whistler on December 10^th 2005

Organisers:

Chris Watkins, Nicolo Cesa-Bianchi, Stefan Schaal, John Shawe-Taylor

Theme:

Reinforcement learning is a very attractive idea. When people first encounter RL, they usually find it compellingly persuasive. But although the last 10 years of RL research have produced deep advances in theory and a range of new algorithms, practical uses are still less widespread than we would all wish.

The intense research effort – and considerable progress – in RL theory is partly because RL is a clean and simple abstract model of behavioural learning.

The aim of this workshop is to go back to first principles, in the spirit of Cartesian systematic doubt, and to question whether there may be other plausible abstract models of behavioural learning that could serve as starting points for theoretical research.

The Deceptive Appeal of Reinforcement Learning

RL is a very appealing theory, for several reasons.

First, RL seems to provide a direct formalisation of folk, intuitive notions of teaching with, and learning from rewards and punishments.
Second, RL can be viewed as incremental dynamic programming, and dynamic programming is a general method for solving optimal control problems, with considerable existing theory, and excellent optimisation properties.
Third, RL appears biologically plausible: there are `direct’ or model-free methods that require only simple incremental updates, very short episodic memory, and no look-ahead. For more sophisticated organisms that are capable of forming predictive models about the effects of actions and of planning ahead, these look-aheads and plans could be progressively compiled into policies, and learning would be faster.
Lastly, RL seems to provide what a human designer needs: the designer of an agent usually has some definite intentions of what the agent should achieve, and this may be expressed as real-time performance measure, which may be used as a reward function for RL. It is not necessarily straightforward to transform design requirements into reward functions for RL, but the intuitive correspondence between RL rewards and design requirements is attractive.

On closer examination, however, these apparently attractions of RL may seem less convincing.

Subjectivity of Rewards

First, in animal learning, the rewards are subjective – they are computed by the animal itself. For robots this is not a problem: the designer can give the robot the capacity to compute its own rewards, or may broadcast the rewards to it. In animals the reward system needs to be innate. Is specifying an innate reward system the most effective way for evolution to specify complex or adaptive behaviour? This seems unclear.

For some types of task such as foraging for scraps of food, it seems common-sensical – though unformalised – that subjective rewards would be a natural way to specify behaviour. For other tasks, this is not so clear: for example, it is not nearly so plausible that learning to avoid predators can be naturally modelled as reinforcement learning.

Much evidence from animal studies shows that learning from rewards – also known as instrumental or operant conditioning – explains only a fraction of animal behavioural learning, and even if instrumental conditioning occurs it can be over-ridden by other learning mechanisms.

Single Reward Signal Inadequate for Learning Complex Behaviour

Second, optimising a single measure of reward seems an inadequate model of learning more complex skills. An example of a natural type of behavioural learning that is difficult to cast naturally as reinforcement learning is that of learning a competence to navigate from any starting point to any final destination within some locale. One can specify the agent’s state as including its current position and its intended destination, but this doubles the dimensionality of the state space without appearing to provide any natural simplication of the problem.

RL seems in some ways problematic both for very simple and for complex types of behavioural learning: are there other natural formal models that could be profitably studied?

Minimum Regret Formulations of Reinforcement Learning

The conventional formulation of RL assumes a stationary world, in which state transitions and rewards can be statistically modelled, either explicitly or implicitly. A surprising recent development is that competitive learning approaches are feasible for bandit problems, and for other types of RL also. These approaches make very weak assumptions about the statistical properties of the world, and yet are feasible. Whether these approaches are appropriate models of animal learning is an intriguing question.

Models of Behavioural Learning

Nips Workshop at Whistler on December 10th 2005

Nips Workshop at Whistler on December 10^th 2005