nach oben

EURASIP Journal on Wireless Communications and Networking

Erschienen in:

Open Access 01.12.2010 | Research Article

Multiagent -Learning for Aloha-Like Spectrum Access in Cognitive Radio Systems

verfasst von: Husheng Li

Erschienen in: EURASIP Journal on Wireless Communications and Networking | Ausgabe 1/2010

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

An Aloha-like spectrum access scheme without negotiation is considered for multiuser and multichannel cognitive radio systems. To avoid collisions incurred by the lack of coordination, each secondary user learns how to select channels according to its experience. Multiagent reinforcement leaning (MARL) is applied for the secondary users to learn good strategies of channel selection. Specifically, the framework of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq2_HTML.gif

-learning is extended from single user case to multiagent case by considering other secondary users as a part of the environment. The dynamics of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq3_HTML.gif

-learning are illustrated using a Metrick-Polak plot, which shows the traces of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq4_HTML.gif

-values in the two-user case. For both complete and partial observation cases, rigorous proofs of the convergence of multiagent

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq5_HTML.gif

-learning without communications, under certain conditions, are provided using the Robins-Monro algorithm and contraction mapping, respectively. The learning performance (speed and gain in utility) is evaluated by numerical simulations.

1. Introduction

In recent years, cognitive radio has attracted intensive studies in the community of wireless communications. It allows users without license (called secondary users) to access licensed frequency bands when licensed users (called primary users) are not present. Therefore, the cognitive radio technique can substantially alleviate the problem of underutilization of frequency spectrum [1, 2].

The following two problems are key to cognitive radio systems.

(i)

Resource mining, that is, how to detect the available resource (the frequency bands that are not being used by primary users); usually it is done by spectrum sensing.

(ii)

Resource allocation, that is, how to allocate the available resource to different secondary users.

Substantial work has been done for the resource mining problem. Many signal processing techniques have been applied to sense the frequency spectrum [3], for example, cyclostationary feature [4], quickest change detection [5], and collaborative spectrum sensing [6]. Meanwhile, a significant amount of research has been conducted for the resource allocation in cognitive radio systems [7, 8]. Typically, it is assumed that secondary users exchange information about available spectrum resources and then negotiate on the resource allocation according to their own requirements of traffic (since the same resource cannot be shared by different secondary users if orthogonal transmission is assumed). These studies typically apply theories in economics, for example, game theory, bargaining theory, or microeconomics.

However, in many applications of cognitive radio, such a negotiation-based resource allocation may incur significant overhead. In traditional wireless communication systems, the available resource is almost fixed (even if we consider the fluctuation of channel quality incurred by fading, the change of available resource is usually very slow and thus can be considered stationary). Therefore, the negotiation need not be carried out frequently, and the negotiation result can be applied for a long period of data communication, thus incurring only tolerable overhead. However, in many cognitive radio systems, the resource may change very rapidly since the activity of primary users may be highly dynamic. Therefore, the available resource needs to be updated very frequently, and the data communication period between two spectrum sensing periods should be fairly short since minimum violation to primary users should be guaranteed. In such a situation, the negotiation of resource allocation may be highly inefficient since a substantial portion of time needs to be used for the negotiation. To alleviate such an inefficiency, high-speed transceivers need to be used to minimize the time consumed on negotiation. Particularly, the turn-around time that is, the time needed to switch from receiving (transmitting) to transmitting (receiving) should be very small, which is a substantial challenge to hardware design.

Motivated by the above discussion and observation, in this paper, we study the problem of spectrum access without negotiation in multiuser and multichannel cognitive radio systems. The spectrum access without negotiation is achieved by applying the framework of reinforcement learning. In such a scheme, each secondary user senses channels and then chooses an idle frequency channel to transmit data, as if no other secondary user exists. If two secondary users choose the same channel for data transmission, they will collide with each other and the corresponding data packets cannot be decoded. Such a procedure is illustrated in Figure 1, where three secondary users access an access point via four channels. Note that such a scheme is similar to Aloha [9] where no explicit collision avoidance is applied. We can also apply techniques similar to

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq6_HTML.gif

-persistent Carrier Sensing Multiple Access (CSMA) that is, each secondary user transmits with probability

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq7_HTML.gif

when it finds an available channel. However, it is beyond the scope of this paper. In the Aloha-like approach, since there is no mutual communication among these secondary users, collision is unavoidable. However, the secondary users can try to learn collision avoidance, as well as channel qualities (we assume that the secondary users have no a priori information about the channel qualities), according to their experience. In such a context, the learning procedure includes not only the available frequency spectrum but also the behavior of other secondary users.

To accomplish the learning of Aloha-like spectrum access, multiagent reinforcement learning (MARL) [10] is applied in this paper. One challenge of MARL in our context is that the secondary users do not know the payoffs (thus do not know the strategies) of other secondary users in each stage; thus the environment of each secondary user, including other secondary users, is nonstationary and may not assure the convergence of learning. Due to the assumption that there is no mutual communication between different secondary users, many traditional MARL techniques like fictitious play [11, 12] and Nash-Q learning [13] cannot be used since they need information exchange among players (e.g., exchanging their action information). To alleviate the lack of mutual communication, we extend the principle of single-agent

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq8_HTML.gif

-learning, that is, evaluating the values of different state-action pairs in an incremental way, to the multiagent situation without information exchange. By applying the theory of stochastic approximation [14], which has been used in many studies on wireless networks [15, 16], we will prove the main result of this paper, that is, the learning converges to a stationary point regardless of the initial strategies (Propositions 1 and 2).

Some studies on reinforcement learning in cognitive radio networks have been done [17‐19]. In [17] and [19], the studies are focused on the resource competition in a spectrum auction system, where the channel allocation is determined by the spectrum regulator, which is different from this paper in which no regulator exists. Reference [18] discusses correlated equilibrium and achieves it by no-regret learning; that is, minimizing the gap between the current reward and optimal reward. In this approach, mutual communication is needed among the secondary users. However, in our study, no intersecondary-user communication is assumed.

Note that the study in this paper has subtle similarities to the evolutionary game theory [20], which has been successfully applied in the cooperative spectrum sensing in cognitive radio systems [21]. Both our study and the evolutionary game focus on the dynamics of strategy changes of users. However, there is a key difference between the two studies. The evolutionary game theory assumes pure strategies for the players (e.g., cooperate or free-ride in cooperative spectrum sensing [21]) and studies the proportions of players using different pure strategies. The key equation in the evolutionary game theory, called replicator equation, describes the dynamics of the corresponding proportions. In contrast to the evolutionary game, the players in our study use mixed strategies and the basic (16) describes the dynamics of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq9_HTML.gif

-values for different channels. Although the convergence is proved by studying ordinary different equations in both studies, the proof is significantly different since the equations have totally different expressions.

The remainder of this paper is organized as follows. In Section 2, the system model is introduced. Basic elements of the game and the proposed multiagent

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq10_HTML.gif

-learning for fully observable case (i.e., each secondary user can sense all channels) are introduced in Section 3. The corresponding convergence of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq11_HTML.gif

-learning is proved in Section 4. The

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq12_HTML.gif

-learning for partially observable case (i.e., each secondary user can sense a subset of the channels) is discussed in Section 5. Numerical results are provided in Section 6, while conclusions are drawn in Section 7.

2. System Model

We consider

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq13_HTML.gif

active secondary users accessing

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq14_HTML.gif

licensed frequency channels. (When there are more than

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq15_HTML.gif

channels, there is less competition; thus making the problem easier. We do not consider the case when the number of channels is less than the number of secondary users since a typical cognitive radio system can provide sufficient channels. Meanwhile, the proposed algorithm can also be applied to all possible cases of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq16_HTML.gif

) We index the secondary users, as well as the channels, by integers 1, 2,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq17_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq18_HTML.gif

. For simplicity, we denote by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq19_HTML.gif

the set of users (channels) different from user (channel)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq20_HTML.gif

The following assumptions are made throughout this paper.

(i)

The secondary users are sufficiently close to each other such that they share the same activity of primary users. There is no communication among these secondary users, thus excluding the possibility of negotiation.

(ii)

We assume that the activity of primary users over each channel is a Markov chain (A more reasonable model for the activity of primary users is the semi-Markov chain. The corresponding analysis is more tedious but similar to that in this paper. Therefore, for simplicity of analysis, we consider only Markov chain in this paper)with states

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq21_HTML.gif

(busy: the channel is occupied by primary users and cannot be used by secondary users) and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq22_HTML.gif

(idle: there is no primary user over this channel). We denote by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq23_HTML.gif

the state of channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq24_HTML.gif

in the sensing period of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq25_HTML.gif

th spectrum access period. For channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq26_HTML.gif

, the transition probability from state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq27_HTML.gif

to state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq28_HTML.gif

(resp., from state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq29_HTML.gif

to state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq30_HTML.gif

) is denoted by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq31_HTML.gif

(resp.,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq32_HTML.gif

). We assume that the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq33_HTML.gif

Markov chains for the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq34_HTML.gif

channels are mutually independent. We also assume perfect spectrum sensing and do not consider possible errors of spectrum sensing.

(iii)

We assume that the channel state transition probabilities, as well as the channel rewards, are unknown with the secondary users at the beginning. They are fixed throughout the game, unless otherwise noted. Therefore, the secondary users need to learn the channel properties.

(iv)

The timing structure of spectrum sensing and data transmission is illustrated in Figure 2, where data is transmitted after the spectrum sensing period. We assume that each secondary user is able to sense only one channel during the spectrum sensing period and transmit over only one channel during the data transmission period.

In Sections 3 and 4, we consider the case in which all secondary users have full knowledge of channel states in the previous spectrum access period (complete observation). Note that this does not contradict the assumption that a secondary user can sense only one channel during the spectrum sensing period since the secondary user can continue to sense other channels during the data transmission period (suppose that the signal from primary users can be well distinguished from that from secondary users, e.g., using different cyclostationary features [22]). If we consider the set of channel states in the previous spectrum access period as the system state, denoted by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq35_HTML.gif

at spectrum access period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq36_HTML.gif

, then the previous assumption implies a completely observable system state, which substantially simplifies the analysis. In Section 5, we will also study the case in which secondary users cannot continue to sense during the data transmission period (partial observation); thus each secondary user has only partial observations about the system state.

3. Game and -Learning

In this section, we introduce the game associated to the learning procedure and the application of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq38_HTML.gif

-learning to the Aloha-like spectrum access problem. Note that in this section and Section 3, we assume that each secondary user knows all channel states in the previous time slot, that is, the completely observable case.

3.1. Game of Aloha-Like Spectrum Access

The Aloha-like spectrum access problem is essentially an

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq39_HTML.gif

game. When secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq40_HTML.gif

transmits over an idle channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq41_HTML.gif

, it receives reward

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq42_HTML.gif

(e.g., channel capacity or successful transmission probability), if no other secondary user transmits over this channel, and reward 0, if one or more other secondary users are transmitting over this channel, since collision will happen. We assume that the reward

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq43_HTML.gif

does not change with time. When channels change slowly, the learning algorithm proposed in this paper can also be applied to track the change of channels. When channels change very fast, it is impossible for secondary users to learn. Since there is no explicit information exchange among secondary users, the collision avoidance is completely based on the received reward. The payoff matrices for the case of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq44_HTML.gif

are given in Figure 3. Note that the actions, denoted by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq45_HTML.gif

for user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq46_HTML.gif

at time

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq47_HTML.gif

, in the game are the selections of channels. Obviously, the diagonal elements in the payoff matrices are all zero since collision yields zero reward.

It is well known that Nash equilibrium means the strategies such that unilaterally changing strategy incurs the degradation of its own performance. Mathematically, a Nash equilibrium means a set of strategies

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq48_HTML.gif

, where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq49_HTML.gif

is the strategy of player

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq50_HTML.gif

, which satisfy

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ1_HTML.gif

(1)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq51_HTML.gif

means the reward of player

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq52_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq53_HTML.gif

means the strategies of all players except player

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq54_HTML.gif

It is easy to verify that there are multiple Nash equilibrium points in the game. Obviously, orthogonal transmission strategies, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq55_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq56_HTML.gif

, are pure equilibria. The reason is the following. If a secondary user changes its strategy and transmits over other channels with nonzero probability, those transmission will collide with other secondary users (recall that, for the Nash equilibrium, all other secondary users do not change their strategies) and incurs performance degradation. The orthogonal channel assignment can be achieved in the following approach: let all secondary users sense the channel randomly at the very beginning; once a secondary user finds an idle channel, it will access this channel forever; after a random number of rounds, all secondary users will find different channels, thus achieving the orthogonal transmission. We call this scheme the simple orthogonal channel assignment since it is simple and fast. However, in this scheme, the different rewards of different channels are ignored. As will be seen in the numerical simulation results, the proposed learning procedure can significant outperform the simple orthogonal channel assignment.

3.2. -Value

We define the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq58_HTML.gif

-function as the expected reward in one time slot (since the channel states are completely known to the secondary users and are not controlled by the secondary users, each secondary user needs to consider only the expected reward in one time slot, that is, a myopic strategy) of each action under different states; that is, for secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq59_HTML.gif

and system state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq60_HTML.gif

, the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq61_HTML.gif

-value of choosing channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq62_HTML.gif

is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ2_HTML.gif

(2)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq63_HTML.gif

is the reward obtained by secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq64_HTML.gif

, which is dependent on the action, as well as the system state, and the expectation is over the randomness of other users' actions, as well as the primary users' occupancies.

3.3. Exploration

In contrast to fictitious play [11], which is deterministic, the action in

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq65_HTML.gif

-learning is random. We assign nonzero probabilities for all channels such that all channels will be explored. Such an exploration guarantees that good channels will not be missed during the learning procedure. We consider Boltzmann distribution [23] for random exploration, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ3_HTML.gif

(3)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq66_HTML.gif

is called temperature, which controls the randomness of exploration. Obviously, the smaller

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq67_HTML.gif

is (the colder), the more focused the actions are. When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq68_HTML.gif

, each user chooses only the channel having the largest

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq69_HTML.gif

-value.

When secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq70_HTML.gif

selects channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq71_HTML.gif

and the system state is

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq72_HTML.gif

, the expected reward is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ4_HTML.gif

(4)

since secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq73_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq74_HTML.gif

) chooses channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq75_HTML.gif

with probability

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq76_HTML.gif

(collision happens and secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq77_HTML.gif

receives no reward) and channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq78_HTML.gif

is idle with probability

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq79_HTML.gif

; then the product in (4) is the probability that no other secondary user accesses channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq80_HTML.gif

3.4. Updating -Values

In the procedure of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq82_HTML.gif

-learning, the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq83_HTML.gif

-functions are updated after each spectrum access using the following rule:

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ5_HTML.gif

(5)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq84_HTML.gif

is a step factor (when channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq85_HTML.gif

is not selected by user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq86_HTML.gif

, we set

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq87_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq88_HTML.gif

is the reward of secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq89_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq90_HTML.gif

is the characteristic function of the event that channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq91_HTML.gif

is selected by secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq92_HTML.gif

at the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq93_HTML.gif

th spectrum access. Note that this is the standard

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq94_HTML.gif

-learning without considering the future states. An intuitive explanation for (5) is that, once channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq95_HTML.gif

is accessed, the corresponding

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq96_HTML.gif

-value is updated by combining the old value and the new reward; if channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq97_HTML.gif

is not chosen, we keep the old value by setting

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq98_HTML.gif

. Our study is focused on the dynamics of (5). To assure convergence, we assume that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ6_HTML.gif

(6)

as well as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ7_HTML.gif

(7)

Note that, in a typical stochastic game setting and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq99_HTML.gif

-learning, the updating rule in (5) should consider the reward of the future and add a discounted term of the future reward to the right hand side of (5). However, in this paper, the optimal strategy is myopic since we assume that the system state is known, and thus the secondary users' actions do not affect the system state. For the case of partial observation (i.e., each secondary user knows only the state of a single channel), the action does change each secondary user's state (typically the belief of system state), and the future reward should be included in the right hand side of (5), which will be discussed in Section 5.

3.5. Stationary Point

The

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq100_HTML.gif

-values for different users are mutually coupled and all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq101_HTML.gif

-values change if one

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq102_HTML.gif

-value is changed since the strategy of the corresponding user is changed, thus changing the expected rewards of other users. We define

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq103_HTML.gif

-values satisfying the following equations as a stationary point

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ8_HTML.gif

(8)

Note that the stationarity is only in the statistical sense since the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq104_HTML.gif

-values can fluctuate around the stationary point due to the randomness of exploration. Obviously, as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq105_HTML.gif

, the stationary point converges to a Nash equilibrium point. However, we are still not sure about the existence of such a stationary point. The following lemma assures the existence of stationary point. The proof is given in Appendix .

Lemma 1.

For sufficiently small

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq106_HTML.gif

, there exists at least one stationary point satisfying (8).

4. Convergence of -learning without Information Exchange

In this section, we study the convergence of the proposed

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq108_HTML.gif

-learning algorithm. First, we provide an intuitive explanation for the convergence in

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq109_HTML.gif

case. Then, we apply the tools of stochastic approximation and ordinary differential equation (ODE) to prove the convergence rigorously.

4.1. Intuition on Convergence

As will be shown in Proposition 1, the updating rule of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq110_HTML.gif

-values in (5) will converge to a stationary equilibrium point close to Nash equilibrium. Before the rigorous proof, we provide an intuitive explanation for the convergence using the geometric argument proposed in [24].

The intuitive explanation is provided in Figure 4 for the case of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq111_HTML.gif

(we call it Metrick-Polak plot since it was originally proposed by A. Metrick and B. Polak in [24]). For simplicity, we ignore the indices of state and assume that both channels are idle. The axes are

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq112_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq113_HTML.gif

, respectively. As labeled in the figure, the plane is divided into four regions separated by two lines

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq114_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq115_HTML.gif

, in which the dynamics of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq116_HTML.gif

-learning are different. We discuss these four regions separately.

(i)

Region I: in this region,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq117_HTML.gif

; therefore, secondary user 1 prefers visiting channel 1; meanwhile, secondary user 2 prefers accessing channel 2 since

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq118_HTML.gif

; then, with large probability, the strategies will converge to a stationary point in which secondary users 1 and 2 access channels 1 and 2, respectively.

(ii)

Region II: in this region, both secondary users prefer accessing channel 1, thus causing many collisions. Therefore, both

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq119_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq120_HTML.gif

will be reduced until entering either region I or region III.

(iii)

Region III: similar to region I.

(iv)

Region IV: similar to region II.

Then, we observe that the points in Regions II and IV are unstable and will move into Region I or III with large probability. In Regions I and III, the strategy will move close to the stationary point with large probability. Therefore, regardless where the initial point is, the updating rule in (5) will converge to a stationary point with large probability.

4.2. Stochastic Approximation-Based Convergence

In this section, we prove the convergence of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq123_HTML.gif

-learning of the proposed Aloha-like spectrum access with Boltzman distributed exploration. First, we find the equivalence between the updating rule (5) and Robbins-Monro iteration [25] for solving an equation with unknown expression (a brief introduction is provided in Appendix ). Then, we apply a conclusion in stochastic approximation [14] to relate the dynamics of the updating rule to an ODE and prove the convergence of the ODE.

4.2.1. Robbins-Monro Iteration

At a stationary point, the expected values of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq124_HTML.gif

-functions satisfy the equations in (8). For system state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq125_HTML.gif

, define

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ9_HTML.gif

(9)

Then, (8) can be rewritten as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ10_HTML.gif

(10)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ11_HTML.gif

(11)

and (function mod

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq126_HTML.gif

means the remainder of dividing integer

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq127_HTML.gif

with integer

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq128_HTML.gif

)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ12_HTML.gif

(12)

with convention

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq129_HTML.gif

. Obviously,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq130_HTML.gif

is the probability that channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq131_HTML.gif

can be used by secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq132_HTML.gif

without collision with other secondary users, when the current system state is

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq133_HTML.gif

Then, the updating rule in (5) is equivalent to solving (8) (the expression of the equation is unknown since the rewards, channel transition probabilities, as well as the strategies of other users, are all unknown) using Robbins-Monro algorithm [14], that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ13_HTML.gif

(13)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq134_HTML.gif

is the vector of all step factors,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq135_HTML.gif

is the vector of rewards obtained at spectrum access period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq136_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq137_HTML.gif

is a random observation on function

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq138_HTML.gif

contaminated by noise, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ14_HTML.gif

(14)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq139_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq140_HTML.gif

is noise and (recall that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq141_HTML.gif

means the reward of secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq142_HTML.gif

at time

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq143_HTML.gif

)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ15_HTML.gif

(15)

Obviously,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq144_HTML.gif

since the expectation of the difference between the reward and the expected reward is equal to 0. Therefore, the observation

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq145_HTML.gif

is a Martingale difference.

4.2.2. ODE and Convergence

The procedure of Robbins-Monro algorithm (i.e., the updating of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq146_HTML.gif

-value) is the stochastic approximation of the solution of the equation. It is well known that the convergence of such a procedure can be characterized by an ODE. Since the noise

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq147_HTML.gif

in (14) is a Martingale difference, it is easy to verify the conditions in Theorem 1 in Appendix and obtain the following lemma (the proof is given in Appendix ).

Lemma 2.

With probability 1, the sequence

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq148_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq149_HTML.gif

, converges to some limit set of the ODE

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ16_HTML.gif

(16)

What remains to do is to analyze the convergence property of the ODE (16). We obtain the following lemma by applying Lyapunov function. The proof is given in Appendix .

Lemma 3.

If a stationary point determined by (10) exists, the solution of ODE (16) converges to the stationary point for sufficiently large

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq150_HTML.gif

Combining Lemmas 1, 2, and 3, we obtain the main result in this paper.

Proposition 1.

Suppose that a stationary point determined by (10) exists. For any system state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq151_HTML.gif

and sufficiently large

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq152_HTML.gif

, the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq153_HTML.gif

-learning converges to a stationary point with probability 1.

Note that a sufficiently small

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq154_HTML.gif

guarantees the existence of stationary point and a sufficiently large

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq155_HTML.gif

assures the convergence of the learning procedure. However, they do not conflict since they are not necessary conditions. As we found in our simulations, we can always choose a suitable

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq156_HTML.gif

to guarantee the existence of the stationary point and the convergence.

5. -Learning with Partial Observations

In this section, we remove the assumption that all secondary users know all channel states in the previous spectrum access period and assume that each secondary user knows the state of only the channel sensed in the previous spectrum access period; thus making the system state partially observable. The difficulties of analyzing such a scenario are given below:

(i)

The system state is partially observable.

(ii)

The game is imperfectly monitored, that is, each player does not know other players' actions.

(iii)

The game has incomplete information, that is, each player does not know the strategies of other players, as well as their beliefs on the system state.

Note that the latter two difficulties are common for both the complete and partial observation cases. However, the imperfect monitoring and incomplete information add much more difficulty in the partial observation case. In this section, we formulate the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq158_HTML.gif

-learning algorithm and then prove the convergence under certain conditions.

5.1. State Definition

It is well known that, in partially observable Markov decision process (POMDP) problems, the belief on the system state, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq159_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq160_HTML.gif

is the observation history before period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq161_HTML.gif

), can play the role of system state. Due to the special structure of the game, we can define the state of secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq162_HTML.gif

at period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq163_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ17_HTML.gif

(17)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq164_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq165_HTML.gif

, means the number of consecutive periods during which channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq166_HTML.gif

has not been sensed before period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq167_HTML.gif

(e.g., if the last time that channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq168_HTML.gif

was sensed by secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq169_HTML.gif

is time slot

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq170_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq171_HTML.gif

.) and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq172_HTML.gif

is the state of channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq173_HTML.gif

in the last time when it is sensed before period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq174_HTML.gif

5.2. Learning in the POMDP Game

For the purpose of learning, we define the objective function for user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq175_HTML.gif

as the discounted sum of rewards in each spectrum access period with discount factor

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq176_HTML.gif

, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ18_HTML.gif

(18)

Then, to maximize the objectively function, the corresponding

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq177_HTML.gif

-learning strategy is given by [23]

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ19_HTML.gif

(19)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq178_HTML.gif

is uniquely determined by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq179_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq180_HTML.gif

, and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq181_HTML.gif

is the step factor dependent on the time, channel, user, and belief state. Note that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq182_HTML.gif

is the system state in the next time slot, which is random. Intuitively, the new

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq183_HTML.gif

-value is updated by combining the old value and the new estimation, which is the sum of the new reward and discounted old

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq184_HTML.gif

-value.

Similarly to the complete information situation, we have the following proposition which states the convergence of the learning procedure with partial information and large

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq185_HTML.gif

. The proof is given in Appendix . Note that numerical simulation shows that small

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq186_HTML.gif

also results in convergence. However, we are still unable to prove it rigorously.

Proposition 2.

When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq187_HTML.gif

is sufficiently large, the learning procedure in (19) converges.

6. Numerical Results

In this section, we use numerical simulations to demonstrate the theoretical results obtained in previous sections. For the fully observable case, we use the following step factor:

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ20_HTML.gif

(20)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq188_HTML.gif

is the initial learning factor. A similar step factor

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq189_HTML.gif

is used for the partially observable case. In Sections 6.1, 6.2, and 6.3, we consider the fully observable case and, in Section 6.4, we consider the partially observable case. Note that, in all simulations, we initialize the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq190_HTML.gif

-values by choosing uniformly random variables in the interval

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq191_HTML.gif

6.1. Dynamics

Figures 5 and 6 show the dynamics of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq192_HTML.gif

versus

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq193_HTML.gif

(recall that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq194_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq195_HTML.gif

) of several typical trajectories for the state of both channels being idle when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq196_HTML.gif

. We assume that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq197_HTML.gif

for all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq198_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq199_HTML.gif

. Note that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq200_HTML.gif

in Figure 5 and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq201_HTML.gif

in Figure 6. We observe that the trajectories move from unstable regions (II and IV in Figure 4) to stable regions (I and III in Figure 4). We also observe that the trajectories for smaller temperature

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq202_HTML.gif

is smoother since less explorations are carried out.

Figure 7 shows the evolution of the probability of choosing channel 1 when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq209_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq210_HTML.gif

and both channels are idle. We observe that both secondary users prefer channel 1 at the beginning and soon secondary user 1 intends to choose channel 2, thus avoiding collisions.

6.2. CDF of Rewards

In this subsection, we consider the performance of reward averaged over all system states. When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq212_HTML.gif

, we set

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq213_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq214_HTML.gif

for the three channels, respectively. When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq215_HTML.gif

, we use the first two channels in the case of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq216_HTML.gif

. The rewards of different channels for different secondary users are randomly generated with a uniform distribution between

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq217_HTML.gif

. The CDF curves of performance gain, defined as the difference of average rewards after and before the learning procedure, are plotted in Figure 8 for both

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq218_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq219_HTML.gif

. Note that the CDF curves are obtained from 100 realizations of learning procedure. From a CDF curve, we can read the distribution of the performance gains. For example, for the curve

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq220_HTML.gif

, performance gain 0.4 in the horizontal axis corresponds to 0.6 in the vertical axis; this means that around 60% of the secondary users obtain performance gain less than 0.4. We observe that when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq221_HTML.gif

, most performance gains are positive. However, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq222_HTML.gif

, a small portion of the performance gains are negative, that is, the performance is decreased after the learning procedure. Such a performance reduction is reasonable since Nash equilibrium may not be Pareto optimal. We also plotted the average performance gains versus different

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq223_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq224_HTML.gif

in Figure 9. We observe that larger

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq225_HTML.gif

results in worse performance gain. When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq226_HTML.gif

is small, smaller

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq227_HTML.gif

yields better performance, but decreases faster than larger

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq228_HTML.gif

when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq229_HTML.gif

increases. The performance gain over the simple orthogonal channel assignment scheme is given in Figure 10. We observe that the learning procedure generates a much better performance than the simple orthogonal channel assignment.

6.3. Learning Speed

We define the stopping time of learning as the time that the relative fluctuation of average reward, which is obtained from 2000 spectrum access periods using the current

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq234_HTML.gif

-values, has been below 5 percent for successive 5 time slots. That is, compute the relative fluctuation at time slot

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq235_HTML.gif

using

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ21_HTML.gif

(21)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq236_HTML.gif

is the vector containing all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq237_HTML.gif

-values and the norm is 2-norm. Then, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq238_HTML.gif

is smaller than 0.05 for 5 consecutive times, we claim that the learning is completed. Then, the learning delay is the time spent before the stopping time. Obviously, the smaller the learning delay is, the faster the learning is. Figures 11 and 12 show the delays of learning, which characterizes the learning speed, for different learning factor

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq239_HTML.gif

and different temperature

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq240_HTML.gif

, respectively, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq241_HTML.gif

. The original

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq242_HTML.gif

values are randomly selected. When the probabilities of choosing channel 1 are larger than 0.95 for one secondary user and smaller than 0.05 for the other secondary user, we claim that the learning procedure is completed. We observe that larger learning factor

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq243_HTML.gif

results in smaller delay while smaller

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq244_HTML.gif

yields faster learning procedure.

The speed of learning is compared for

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq249_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq250_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq251_HTML.gif

in Figure 13 (both

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq252_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq253_HTML.gif

are fixed). We observe that, for more than 90% of the realizations, the learning can be completed within 20 spectrum access periods. However, the learning procedure may last for a long period of time for some situations. We can notice that the learning speeds are similar for cases

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq254_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq255_HTML.gif

. We also observe that, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq256_HTML.gif

is much larger (

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq257_HTML.gif

), the increase of delay is not significant.

6.4. Time-Varying Channel Rewards

In previous simulations, the rewards of successfully accessing channels,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq258_HTML.gif

are assumed to be constant. In practical systems, they may change with time since wireless channels are usually dynamic. In Figure 14, we show the CDF of performance gains (the configuration is the same as that in Figure 8) when channel changes slowly. We used a simple model for channel reward, which is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ22_HTML.gif

(22)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq259_HTML.gif

is a random variable uniformly distributed between 0 and 1. From Figure 14, we observe that the learning algorithm still improves the performance significantly.

6.5. Partial Observation Case

Figure 15 shows the performance gain of learning in the case of partial observations. We adopt the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq260_HTML.gif

-learning mechanism introduced in Section 5. Note that there are infinitely many belief states since a channel could be unsensed for an infinite period of time. For computational simplicity, we set all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq261_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq262_HTML.gif

(recall that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq263_HTML.gif

is the period of time that channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq264_HTML.gif

has not been sensed by user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq265_HTML.gif

before time

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq266_HTML.gif

). From Figure 15, we observe that the performance is actually degraded for around 40% (

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq267_HTML.gif

) or 50% (

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq268_HTML.gif

) cases. However, the amplitude of performance degradation is averagely less than the amplitude of performance gain. We also observe that the performance gain is decreased when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq269_HTML.gif

is increased from 2 to 3.

The learning delay for the partial observation case is shown in Figure 16, where the simulation setup is similar to that of Figure 13. Again, we observe that the learning speeds of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq271_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq272_HTML.gif

are similar to each other.

7. Conclusions

We have discussed a learning procedure for Aloha-like spectrum access without negotiation in cognitive radio systems. During the learning, each secondary user considers the channel and other secondary users as its environment, updates its

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq273_HTML.gif

-values, and takes the best action. An intuitive explanation for the convergence of learning is provided using Metrick-Polak plot. By applying the theory of stochastic approximation and ODE, we have shown the convergence of learning under certain conditions. We also extended the case of full observations to the case of partial observations. Numerical results show that secondary users can learn to avoid collision quickly. The performance after the learning is significantly better than that before the learning and that using a simple scheme to achieve a Nash equilibrium. Note that our study is one extreme of the resource allocation problem since no negotiation is considered, while the other extreme is full negotiation to achieve optimal performance. Our future work will be the intermediate case; that is, limited negotiation for resource allocation.

Acknowledgment

This work was supported by the National Science Foundation under grant CCF-0830451.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Vorheriger Artikel A Propagation Environment Modeling in Foliage

Nächster Artikel Energy-Efficient Reservation-Based Medium Access Control Protocol for Wireless Sensor Networks

Appendices

A. Stochastic Approximation

For being self-contained, we briefly introduce the theory of stochastic approximation and cite the conclusion used for proving Lemma 2.

Essentially, stochastic approximation is used to solve an equation with unknown expression and noisy observations. Consider equation

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ23_HTML.gif

(A.1)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq274_HTML.gif

is the unknown variable and the expression of function

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq275_HTML.gif

is unknown. Denote by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq276_HTML.gif

the solution to this equation (we assume that there is only one solution to the equation). Suppose that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq277_HTML.gif

for

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq278_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq279_HTML.gif

for

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq280_HTML.gif

. We have a series of noisy observations of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq281_HTML.gif

, denoted by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq282_HTML.gif

. Then, we can approximate the solution iteratively in the following way (called Robbins-Monro algorithm).

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ24_HTML.gif

(A.2)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq283_HTML.gif

is the step for the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq284_HTML.gif

th iteration.

The convergence of (A.2) is deeply related to a "mean" ODE, which is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ25_HTML.gif

(A.3)

The following Theorem (part of Theorem

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq285_HTML.gif

in [14]) discloses the relationship between the convergence in (A.2) and the mean ODE in (A.3).

Theorem 1.

If the following assumptions are satisfied,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq286_HTML.gif

converges to a limit set

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq287_HTML.gif

in which all points satisfy

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq288_HTML.gif

with probability 1:

(A)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq289_HTML.gif

(B)

Noise is a martingale difference, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ26_HTML.gif

(A.4)

(C)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq290_HTML.gif

is continuous.

(D)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq291_HTML.gif

(E)

There exists a continuously differentiable function

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq292_HTML.gif

such that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq293_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq294_HTML.gif

is a constant on the limit set

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq295_HTML.gif

B. Proof of Lemma 1

Proof.

For simplicity, we fix one system state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq296_HTML.gif

since the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq297_HTML.gif

-learning procedures for different state

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq298_HTML.gif

are mutually independent when the system state is fully observable and the action of each secondary user does not affect the system state. Consider a Nash equilibrium point, at which there is no collision. Without loss of generality, we assume that secondary users

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq299_HTML.gif

use channels

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq300_HTML.gif

, respectively.

Now, we choose a set of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq301_HTML.gif

such that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ27_HTML.gif

(B.1)

Then, we can always choose a sufficiently small

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq302_HTML.gif

such that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ28_HTML.gif

(B.2)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ29_HTML.gif

(B.3)

since the right hand sides of (B.2) and (B.3) converge to

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq303_HTML.gif

and 0, respectively, as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq304_HTML.gif

Then, we carry out the following iterations, that is, the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq305_HTML.gif

-values of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq306_HTML.gif

th iteration is given by,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq307_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ30_HTML.gif

(B.4)

Next, we show that,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq308_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq309_HTML.gif

increases while

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq310_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq311_HTML.gif

) decreases during the iterations by carrying out inductions on

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq312_HTML.gif

. For the first iteration,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq313_HTML.gif

is increased while

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq314_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq315_HTML.gif

) is decreased due to the conditions (B.2) and (B.3). Suppose that, in the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq316_HTML.gif

th iteration, the conclusion holds. Then, in the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq317_HTML.gif

th iteration,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq318_HTML.gif

is increased due to the expression of the right hand side of (B.4) and the assumptions

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq319_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq320_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq321_HTML.gif

). For the same reason,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq322_HTML.gif

is decreased in the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq323_HTML.gif

th iteration. This concludes the induction.

Now, we have shown that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq324_HTML.gif

is a monotonically increasing sequence while

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq325_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq326_HTML.gif

) is a monotonically decreasing sequence. Since all sequences are bounded (

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq327_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq328_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq329_HTML.gif

)), all sequences converge to their limits, which is the stationary point. This concludes the proof.

C. Proof of Lemma 2

Proof.

We verify the conditions in Theorem 1 one by one.

(i)

Condition (A): This is obvious since

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq330_HTML.gif

is upper bounded by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq331_HTML.gif

(recall that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq332_HTML.gif

is the difference between the instantaneous reward and the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq333_HTML.gif

-value).

(ii)

Condition (B): The martingale difference noise has been proved right after (14).

(iii)

Condition (C): The function

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq334_HTML.gif

is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ31_HTML.gif

(C.1)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq335_HTML.gif

is defined in (12). We only need to verify the continuity of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq336_HTML.gif

. Obviously, each element in

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq337_HTML.gif

is differentiable with respect to

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq338_HTML.gif

. Therefore,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq339_HTML.gif

is not only continuous but also differentiable.

(iv)

Condition (D): It is guaranteed by (7).

(v)

Condition (E): The function

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq340_HTML.gif

can be defined as the integral of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq341_HTML.gif

. It is continuously differentiable since

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq342_HTML.gif

is continuous. It is a constant on the limit set since there is only one point at the limit set.

D. Proof of Lemma 3

Proof.

We apply Lyapunov's method to analyze the convergence of the ODE in (16). We define the Lyapunov function as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ32_HTML.gif

(D.1)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq343_HTML.gif

is the expected reward of secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq344_HTML.gif

at period

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq345_HTML.gif

Then, we examine the derivative of the Lyapunov function with respect to time

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq346_HTML.gif

, that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ33_HTML.gif

(D.2)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq347_HTML.gif

We have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ34_HTML.gif

(D.3)

where we applied the ODE (16).

Then, we focus on the computation of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq348_HTML.gif

For secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq349_HTML.gif

and channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq350_HTML.gif

, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ35_HTML.gif

(D.4)

where the first equation is due to the definition of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq351_HTML.gif

and the second equation is due to the rule of the derivative of products.

We consider the derivative in (D.4), that is,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ36_HTML.gif

(D.5)

where the last equation is obtained from ODE (16).

Substituting (D.5) into (D.4), we obtain

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ37_HTML.gif

(D.6)

Combining (D.2) and (D.6), we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ38_HTML.gif

(D.7)

where the coefficient

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq352_HTML.gif

is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ39_HTML.gif

(D.8)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq353_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq354_HTML.gif

, and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ40_HTML.gif

(D.9)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq355_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq356_HTML.gif

. When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq357_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq358_HTML.gif

It is easy to verify

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ41_HTML.gif

(D.10)

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ42_HTML.gif

(D.11)

Therefore, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq359_HTML.gif

is sufficiently large, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ43_HTML.gif

(D.12)

Then, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ44_HTML.gif

(D.13)

Therefore, when

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq360_HTML.gif

is sufficiently large, the derivative of the Lyapunov function is strictly negative, which implies that the ODE (16) converges to a stationary point. This concludes the proof.

Remark 1.

We can actually obtain a stronger conclusion from the last part of the proof, that is, the convergence can be assured if

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ45_HTML.gif

(D.14)

E. Proof of Proposition 2

Proof.

We define a mapping from all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq361_HTML.gif

-values to another set of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq362_HTML.gif

-values, which is given by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ46_HTML.gif

(E.1)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq363_HTML.gif

is determined by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq364_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq365_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq366_HTML.gif

is the average reward when secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq367_HTML.gif

chooses channel

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq368_HTML.gif

. Note that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq369_HTML.gif

is a function of all

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq370_HTML.gif

-values.

What we need to prove is that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq371_HTML.gif

is a contraction mapping. Once this is proved, the remainder part is exactly the same as the proof of the convergence of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq372_HTML.gif

-learning in [26]. Therefore, we focus on the analysis on the mapping

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq373_HTML.gif

We consider two sets of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq374_HTML.gif

-values, denoted by

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq375_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq376_HTML.gif

, respectively. Considering the difference after the mapping

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq377_HTML.gif

between the two sets of

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq378_HTML.gif

-values, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ47_HTML.gif

(E.2)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq379_HTML.gif

means the average reward when the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq380_HTML.gif

-values are

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq381_HTML.gif

. Then, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ48_HTML.gif

(E.3)

We discuss the two terms in (E.3) separately. For the first term, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ49_HTML.gif

(E.4)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq382_HTML.gif

is the set of states of secondary users except user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq383_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq384_HTML.gif

is the probability of the set of states

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq385_HTML.gif

conditioned on the state of secondary user

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq386_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq387_HTML.gif

When

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq388_HTML.gif

is sufficiently large, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ50_HTML.gif

(E.5)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq389_HTML.gif

is a polynomial of order

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq390_HTML.gif

Then, it is easy to verify that (E.4) can be rewritten as

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ51_HTML.gif

(E.6)

where

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq391_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq392_HTML.gif

are both polynomials of smaller order than

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq393_HTML.gif

. Note that the coefficients of both polynomials are independent of the

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq394_HTML.gif

-values.

Then, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ52_HTML.gif

(E.7)

Then, we can always take a sufficiently large

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq395_HTML.gif

such that

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ53_HTML.gif

(E.8)

Now, we turn to the second term in (E.3). Without loss of generality, we assume

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq396_HTML.gif

. We have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ54_HTML.gif

(E.9)

where, in the first inequality, we define

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq397_HTML.gif

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ55_HTML.gif

(E.10)

Due to symmetry, we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ56_HTML.gif

(E.11)

Combining (E.8) and (E.11), we have

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ57_HTML.gif

(E.12)

which implies

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ58_HTML.gif

(E.13)

Therefore,

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq398_HTML.gif

is a contraction mapping under the norm

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq399_HTML.gif

. This concludes the proof.

Remark 2.

Note that, in contrast to the stochastic approximation approach for the proof of the convergence in the complete observation case, we used a different approach to prove the convergence of the learning with partial observations since it is difficult to apply the stochastic approximation in the partial observation case. Although the stochastic approximation approach is slightly more complicated, we can find a finite value for

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq400_HTML.gif

in (D.14) to assure the convergence. For the contraction mapping approach, we are still unable to find such a finite value for

https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq401_HTML.gif

Mitola J III: Cognitive radio for flexible mobile multimedia communications. Mobile Networks and Applications 2001, 6(5):435-441. 10.1023/A:1011426600077CrossRefMATH

Mitola J III: Cognitive Radio, Licentiate Proposal. KTH, Stockholm, Sweden; 1998.

Zhao Q, Sadler BM: A survey of dynamic spectrum access. IEEE Signal Processing Magazine 2007, 24(3):79-89.CrossRef

Kim K, Akbar IA, Bae KK, Um J-S, Spooner CM, Reed JH: Cyclostationary approaches to signal detection and classification in cognitive radio. Proceedings of the 2nd IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, April 2007 212-215.CrossRef

Li H, Li C, Dai H: Quickest spectrum sensing in cognitive radio. Proceedings of the 42nd Annual Conference on Information Sciences and Systems (CISS '08), March 2008, Princeton, NJ, USA 203-208.

Ghasemi A, Sousa ES: Collaborative spectrum sensing for opportunistic access in fading environments. Proceedings of the 1st IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN '05), November 2005 131-136.

Kloeck C, Jaekel H, Jondral F: Multi-agent radio resource allocation. Mobile Networks and Applications 2006, 11(6):813-824. 10.1007/s11036-006-0051-4CrossRef

Niyato D, Hossain E, Han Z: Dynamics of multiple-seller and multiple-buyer spectrum trading in cognitive radio networks: a game-theoretic modeling approach. IEEE Transactions on Mobile Computing 2009, 8(8):1009-1022.CrossRef

Kuo FF: The ALOHA system. In Computer Networks. Prentice-Hall, Englewood Cliffs, NJ, USA; 1973:501-518.

10.

Buşoniu L, Babuška R, De Schutter B: A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics Part C 2008, 38(2):156-172.CrossRef

11.

Fudenberg D, Levine DK: The Theory of Learning in Games. The MIT Press, Cambridge, Mass, USA; 1998.MATH

12.

Robinson J: An iterative method of solving a game. The Annals of Mathematics 1969, 54(2):296-301.CrossRefMathSciNetMATH

13.

Hu J, Wellman MP: Multiagent reinforcement learning: theoretical framework and an algorithm. Proceedings of the 15th International Conference on Machine Learning (ICML '98), July 1998 242-250.

14.

Kushner HJ, Yin GG: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York, NY, USA; 2003.MATH

15.

Liu J, Yi Y, Proutiere A, Chiang M, Poor HV: Towards utility-optimal random access without message passing. Wireless Communications and Mobile Computing 2010, 10(1):115-128. 10.1002/wcm.897CrossRef

16.

Yi Y, de Veciana G, Shakkottai S: MAC scheduling with low overheads by learning neighborhood contention patterns. submitted to IEEE/ACM Transactions on Networking

17.

Fu F, van der Schaar M: Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology 2009, 58(4):1904-1919.CrossRef

18.

Han Z, Pandana C, Liu KJK: Distributive opportunistic spectrum access for cognitive radio using correlated equilibrium and no-regret learning. Proceedings of IEEE Wireless Communications and Networking Conference (WCNC '07), March 2007 11-15.

19.

van der Schaar M, Fu F: Spectrum access games and strategic learning in cognitive radio networks for delay-critical applications. Proceedings of the IEEE 2009, 97(4):720-739.CrossRef

20.

Hofbauer J, Sigmund K: Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK; 1998.CrossRefMATH

21.

Wang B, Liu KJR, Clancy TC: Evolutionary game framework for behavior dynamics in cooperative spectrum sensing. Proceedings of IEEE Conference on Global Communications (Globecom '08), November-December 2008, New Orleans, La, USA 3123-3127.

22.

Sutton PD, Nolan KE, Doyle LE: Cyclostationary signatures in practical cognitive radio applications. IEEE Journal on Selected Areas in Communications 2008, 26(1):13-24.CrossRef

23.

Sutton RS, Barto AG: Reinforcement Learning: A Introduction. The MIT Press, Cambridge, Mass, USA; 1998.

24.

Metrick A, Polak B: Fictitious play in 2 × 2 games: a geometric proof of convergence. Economic Theory 1994, 4(6):923-933. 10.1007/BF01213819MathSciNetCrossRefMATH

25.

Robbins H, Monro S: A stochastic approximation method. The Annals of Mathematical Statistics 1951, 2: 400-407.MathSciNetCrossRefMATH

26.

Tsitsiklis JN: Asynchronous stochastic approximation and Q -learning. Machine Learning 1994, 16(3):185-202.MathSciNetMATH

Titel: Multiagent -Learning for Aloha-Like Spectrum Access in Cognitive Radio Systems
verfasst von: Husheng Li
Publikationsdatum: 01.12.2010
Verlag: Springer International Publishing
Erschienen in: EURASIP Journal on Wireless Communications and Networking / Ausgabe 1/2010
Elektronische ISSN: 1687-1499
DOI: https://doi.org/10.1155/2010/876216

Springer Professional

Abstract

1. Introduction

2. System Model

3. Game and -Learning

3.1. Game of Aloha-Like Spectrum Access

3.2. -Value

3.3. Exploration

3.4. Updating -Values

3.5. Stationary Point

4. Convergence of -learning without Information Exchange

4.1. Intuition on Convergence

4.2. Stochastic Approximation-Based Convergence

4.2.1. Robbins-Monro Iteration

4.2.2. ODE and Convergence

5. -Learning with Partial Observations

5.1. State Definition

5.2. Learning in the POMDP Game

6. Numerical Results

6.1. Dynamics

6.2. CDF of Rewards

6.3. Learning Speed

6.4. Time-Varying Channel Rewards

6.5. Partial Observation Case

7. Conclusions

Acknowledgment

Appendices

A. Stochastic Approximation

B. Proof of Lemma 1

C. Proof of Lemma 2

D. Proof of Lemma 3

E. Proof of Proposition 2

Weitere Artikel der Ausgabe 1/2010

Opportunistic Adaptive Transmission for Network Coding Using Nonbinary LDPC Codes

Efficient Vector-Based Forwarding for Underwater Sensor Networks

HH-MIP: An Enhancement of Mobile IP by Home Agent Handover

Network Modulation: An Algebraic Approach to Enhancing Network Data Persistence

Wireless Network Algorithms, Systems, and Applications

Quality-Assured and Sociality-Enriched Multimedia Mobile Mashup