Which one should I imitate?

https://doi.org/10.1016/S0304-4068(97)00068-2Get rights and content

Abstract

An individual repeatedly faces the same decision. Payoffs generated by his available actions are noisy. This individual forgets earlier experience and is limited to learn by observing current success of two other individuals. We select among behavioral rules that lead an infinite population of identical individuals in the long run to the expected payoff maximizing action. Optimal behavior requires learning by imitation. The second most successful in a sample must sometimes be imitated although the most successful will always be imitated with a higher probability. Inertia is optimal. When each individual follows an optimal rule, then the population evolves according to an aggregate monotone dynamic.

Introduction

We consider a very basic decision problem known in the literature as a multi-armed bandit (Rothschild, 1974; Berry and Fristedt, 1985). An individual must repeatedly choose from a finite set of actions, each action independently generating a noisy or uncertain payoff. Examples range from farmers choosing a particular crop over consumers choosing a mechanic to shops pricing their goods. Our individual would like to choose an action that maximizes expected payoffs (referred to as a best action). However, he does not know the payoff distributions underlying his actions. Instead, he must rely on what he can learn from others.

In this paper, we want to understand how a bounded rational individual should learn from the experience of a limited number of other individuals who face the same decision. We specifically focus on double sampling; here, an individual may learn between making choices from the experience of two other randomly selected individuals. Bounded rationality is modelled by putting extreme bounds on memory. We assume that an individual forgets all that happened before the previous round. Hence, an individual is not able to update a prior over time. Instead, he must follow a much simpler behavior. We will select among behavioral rules with limited memory the one that is best or optimal. The first step is to define the notion of optimality used. Here, we both follow and extend the concepts developed in an earlier paper (Schlag, 1997) on single sampling. Under single sampling, the individual may learn between his choices only from the experience of one other individual. A characterization of optimal behavior under double sampling can shed some light on the robustness of the suspiciously clean single sampling results (Schlag, 1997). Moreover, moving from one to two observations opens the possibility to investigate basic questions of multiple sampling that do not arise under single sampling. Here are some of the questions we wish to answer. How do different observations influence optimal behavior? In search of a simple but optimal rule, can one limit attention to the individual who realized the highest payoff among those observed? Is the notion of optimality used in Schlag (1997) well-defined for multiple sampling? How much more sophisticated must behavior under double sampling be in order to cope optimally with the additional information? How much better is an individual off if he observes two individuals instead of one? We also expect hints from the double sampling setting for how to react to even larger sampling sizes. Unfortunately, a formal analysis of larger samples is (as of yet) not tractable.

Behavioral rules that typically first come to mind do not perform well in this setting. For instance, `imitate the action that realized the highest average payoff among those observed', referred to as Imitate The Best Average, is often brought up as a plausible behavior (Ellison and Fudenberg, 1995; Eshel et al., 1996; Kirchkamp, 1996). Consider an infinite population of individuals each using this rule. Assume that each action is chosen by a positive proportion of the population in the first round of choice. If each individual would have access to an infinite sample then all individuals would be choosing the best action from the second round on for ever. Under single or double sampling, if each action generates a deterministic payoff then individuals would at least be choosing the best action in the long run. However, we are interested in a more realistic setting; only a limited number of observations are available and payoffs are stochastic. Here, similar to the single sampling rule Imitate If Better (Schlag, 1997), Imitate The Best Average performs poorly in the sense described below. It cannot cope with the difference between realized and expected payoff.

We present two alternative viewpoints to quantify when a rule `performs well'. The population-oriented approach considers an infinite population of individuals using the same rule. There we search for a rule under which: (i) all individuals are never worse off in the next round, and (ii) individuals eventually learn which action is best when initially each action is present. The alternative, individual approach considers a myopic individual who wishes to always increase his expected payoff from one round to the next. Both approaches have in common that good performance is an attribute of a rule which is independent of the particular payoff distributions of the multi-armed bandit. This independency is a natural consequence of the initial symmetry in actions. Imitate The Best Average does not perform well since it does not always lead the population to the best action. In fact, in some simple multi-armed bandits, all individuals will adapt the worst action in the long run (see Fig. 1).

In the rest of the paper, we restrict attention to unbiased rules that reflect that actions cannot be distinguished initially apart from their name. A major result of the paper (Theorem 1) shows that formalization of performance leads to the same class of so-called strictly improving behavioral rules. In this result, we characterize the net switching probabilities of three individuals sampling each other who are using the same strictly improving rule. In particular, the following properties of a strictly improving rule come to light. It must learn by imitating, i.e., an individual should never choose an action that is not in his sample. Behavior must be stochastic, i.e., an individual needs a randomizing device to find the action to choose in the next round.

The existence of strictly improving rules has already been established by Schlag (1997). In particular, the optimal rule from the single sampling setting applied randomly to one of the two individuals sampled generates such a rule. Our interest lies in selecting among the strictly improving rules. Selection among the strictly improving single sampling rules is straightforward. Independent of the specific decision problem, strictly improving single sampling rules can be ordered according to the expected increase in payoffs they generate. This is no longer true in the double sampling setting. There are double sampling rules that cannot be ranked according to their performance without making assertions about additional characteristics of the decision at hand. We refrain from adding information about which decision problems are more likely. Instead, we limit attention to rules that always perform better than the optimal single sampling rule. In this class there is a unique ranking according to performance. The best are characterized in Theorem 2. When each individual uses the same such rule, then the population dynamic exhibits a very regular behavior, namely, it is aggregate monotone (sensu Samuelson and Zhang, 1992).

Above elimination yields a set of best rules that perform equally well. Using some additional criteria we single out two distinct rules Double Imitation Rule (DI) and Sequential Proportional Observation Rule (SPOR) and argue that each is optimal in its own sense. The DI (Section 3) enables the individual to switch the least number of times (Proposition 1). This rule has the nice additional property that it never specifies to adapt an action that only realized payoffs below self. The drawback is that its functional form is quite messy. We find that bounded rationality should not only put restrictions on memory but also on functional form. Hence, we present an alternative, drastically simpler rule. The SPOR can be implemented by applying the single sampling rule called Proportional Observation Rule (POR) in sequence to both observations. The POR is the single sampling rule that specifies to imitate an observed individual with a probability proportional to the observed payoff independently of your own last payoff. We show that SPOR is also unique in the following sense. There is no other way to generate one of these best double sampling rules from a single sampling rule by either randomly choosing one of the sampled or by sequentially evaluating a rule. In particular, we have shown that it is necessary to forget own payoffs when generating a sophisticated behavior from a single sampling rule. Among the common properties of DI and SPOR listed in Remark 2, we find that: (i) the second most successful sampled is sometimes imitated although the most successful will always be imitated with a higher probability, and, (ii) inertia is optimal; the previous action is almost never given up with certainty.

In Section 7, we show how the above approach can be extended to a setting where individuals are repeatedly randomly matched to play a game.

Regarding related literature, Börgers et al. (1997) search for strictly improving behavioral rules that find the best action in the long run and which do not rely on sampling. They find that the best action can only be reached with arbitrary high probability. Where the infinite population washes out noise in our model, arbitrary slow learning does in theirs. For an extensive survey of the literature, see the work of Schlag (1997).

The rest of the paper is organized as follows. In Section 2, we describe how payoffs are generated and information is gathered. Feasible individual behavior is specified. In Section 3, we provide examples of behavioral rules and point our some of their properties. In Section 4, strictly improving rules are defined and characterized. In Section 5, strictly improving rules that perform better than the best single sampling rule are derived. In Section 6, two alternative methods for selection based on switching and on computational criteria reveal optimal rules in their own right. In Section 7, our general model is extended to a game playing scenario. Appendix Acontains the proofs.

Section snippets

The setting

Consider the following task. An individual must repeatedly choose one of a finite number of actions from a given set A, each action i yielding an uncertain payoff drawn from a probability distribution Qi with finite support (iA,|A|>2).1 All this individual knows is that payoffs are realized independently and will be contained in a given bounded interval

Examples

Two behavioral rules from the single sampling setting are of particular interest to us. Schlag (1997) argues that the Proportional Imitation Rule with switching rate 1/(ωα) is the optimal single sampling rule. This imitating rule, denoted by fp, satisfies fp(i,x,j,y)j=[yx]+/(ωα) for ji where [x]+=max{x,0}. The so-called Proportional Observation Rule (with switching rate 1/(ωα), denoted by fb) will play an important role for us when constructing simple double sampling rules. This is the

Improving rules

We are interested in finding a rule that performs well when it is used by each individual in the population. Extending the approach of Schlag (1997), we will consider very weak conditions on what it means for a rule to perform well at the cost of demanding that these conditions must hold in many environments. Thus, we depart from the traditional Bayesian approach of assessing a prior belief over the set of multi-armed bandits and states. Notice that the Bayesian approach of updating a prior

Dominating rules

There are many strictly improving rules. However, predicting population behavior using Eq. (2)requires that each individual uses the same such rule. Similarly, the condition of individually improving and globally efficient both rely on the fact that all individuals are using the same rule. In this section we aim to select among the strictly improving rules. If we are successful and find a unique optimal rule for everyone to use then this is an endogenous justification for considering

Selecting among rules

Unfortunately, the characterization of switching functions given in Theorem 2 does not lead to a unique rule. In particular, both DI and SPOR are such rules (see note made after Theorem 1), rules that are best at outperforming the optimal single sampling rule. In the following we will show that each of the rules DI and SPOR is a unique best and hence optimal rule in its own (well defined) sense.

A game playing scenario

What behavior should an individual follow when the multi-armed bandit is not stationary over time? What can he learn? We will consider a popular example for such a situation; individuals will be repeatedly randomly matched to play a fixed game. Non-stationarity arises since behavior in the opposite population will generally change over time. Consider two infinite populations, of equal cardinality, also referred to as population one and two. Let Ai be the finite set of actions available to an

Acknowledgements

I wish to thank Avner Shaked for helpful comments. Financial support from the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 303 at the University of Bonn is gratefully acknowledged.

References (15)

  • M. Rothschild

    A two-armed bandit theory of market pricing

    J. Econ. Theory

    (1974)
  • L. Samuelson et al.

    Evolutionary stability in asymmetric games

    J. Econ. Theory

    (1992)
  • Axelrod, R.M., 1984. The Evolution of Cooperation. Basic Books, New...
  • Berry, D., Fristedt, B., 1985. Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall,...
  • Björnerstedt, J., Schlag, K.H., 1996. On the evolution of imitative behavior. Discussion Paper B-378. University of...
  • Börgers, T., Morales, A., Sarin, R., 1997. Simple behaviour rules which lead to expected payoff maximizing choices....
  • G. Ellison et al.

    Word-of-mouth communication and social learning

    Q. J. Econ.

    (1995)
There are more references available in the full text version of this article.

Cited by (108)

  • Dissatisfaction-driven replicator dynamics of the evolutionary snowdrift game in structured populations

    2022, Physica A: Statistical Mechanics and its Applications
    Citation Excerpt :

    In the literature, most of update rules are based on imitation [14,28–32], namely, an agent may copy the strategy of its opponent who gains higher payoff in the previous round. The important imitation-type rules include the unconditional imitation rule [14,28], the replicator rule [29–31], and the Fermi rule [32]. These update rules can be distinguished from one another in two aspects: the object and probability of imitation.

  • Cultural and evolutionary dynamics with best-of-k learning when payoffs are uncertain

    2019, Theoretical Population Biology
    Citation Excerpt :

    It was demonstrated that success-biased social learning may sometimes perform worse than unbiased imitation in the acquisition of the optimal variant. Note that relative performances of social learning rules in the presence of payoff uncertainties have also been investigated in a slightly different context, in which an individual is allowed to choose an action repeatedly over time from a finite set of actions (“multi-armed bandit”; Schlag, 1998, 1999; Rendell et al., 2011). A possible drawback of Baldini’s (2012, 2013) analysis is that it relies heavily on numerical work.

View all citing articles on Scopus
View full text