Skip to main content
Erschienen in: EURASIP Journal on Wireless Communications and Networking 1/2010

Open Access 01.12.2010 | Research Article

Multiagent https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq1_HTML.gif -Learning for Aloha-Like Spectrum Access in Cognitive Radio Systems

verfasst von: Husheng Li

Erschienen in: EURASIP Journal on Wireless Communications and Networking | Ausgabe 1/2010

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An Aloha-like spectrum access scheme without negotiation is considered for multiuser and multichannel cognitive radio systems. To avoid collisions incurred by the lack of coordination, each secondary user learns how to select channels according to its experience. Multiagent reinforcement leaning (MARL) is applied for the secondary users to learn good strategies of channel selection. Specifically, the framework of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq2_HTML.gif -learning is extended from single user case to multiagent case by considering other secondary users as a part of the environment. The dynamics of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq3_HTML.gif -learning are illustrated using a Metrick-Polak plot, which shows the traces of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq4_HTML.gif -values in the two-user case. For both complete and partial observation cases, rigorous proofs of the convergence of multiagent https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq5_HTML.gif -learning without communications, under certain conditions, are provided using the Robins-Monro algorithm and contraction mapping, respectively. The learning performance (speed and gain in utility) is evaluated by numerical simulations.

1. Introduction

In recent years, cognitive radio has attracted intensive studies in the community of wireless communications. It allows users without license (called secondary users) to access licensed frequency bands when licensed users (called primary users) are not present. Therefore, the cognitive radio technique can substantially alleviate the problem of underutilization of frequency spectrum [1, 2].
The following two problems are key to cognitive radio systems.
(i)
Resource mining, that is, how to detect the available resource (the frequency bands that are not being used by primary users); usually it is done by spectrum sensing.
 
(ii)
Resource allocation, that is, how to allocate the available resource to different secondary users.
 
Substantial work has been done for the resource mining problem. Many signal processing techniques have been applied to sense the frequency spectrum [3], for example, cyclostationary feature [4], quickest change detection [5], and collaborative spectrum sensing [6]. Meanwhile, a significant amount of research has been conducted for the resource allocation in cognitive radio systems [7, 8]. Typically, it is assumed that secondary users exchange information about available spectrum resources and then negotiate on the resource allocation according to their own requirements of traffic (since the same resource cannot be shared by different secondary users if orthogonal transmission is assumed). These studies typically apply theories in economics, for example, game theory, bargaining theory, or microeconomics.
However, in many applications of cognitive radio, such a negotiation-based resource allocation may incur significant overhead. In traditional wireless communication systems, the available resource is almost fixed (even if we consider the fluctuation of channel quality incurred by fading, the change of available resource is usually very slow and thus can be considered stationary). Therefore, the negotiation need not be carried out frequently, and the negotiation result can be applied for a long period of data communication, thus incurring only tolerable overhead. However, in many cognitive radio systems, the resource may change very rapidly since the activity of primary users may be highly dynamic. Therefore, the available resource needs to be updated very frequently, and the data communication period between two spectrum sensing periods should be fairly short since minimum violation to primary users should be guaranteed. In such a situation, the negotiation of resource allocation may be highly inefficient since a substantial portion of time needs to be used for the negotiation. To alleviate such an inefficiency, high-speed transceivers need to be used to minimize the time consumed on negotiation. Particularly, the turn-around time that is, the time needed to switch from receiving (transmitting) to transmitting (receiving) should be very small, which is a substantial challenge to hardware design.
Motivated by the above discussion and observation, in this paper, we study the problem of spectrum access without negotiation in multiuser and multichannel cognitive radio systems. The spectrum access without negotiation is achieved by applying the framework of reinforcement learning. In such a scheme, each secondary user senses channels and then chooses an idle frequency channel to transmit data, as if no other secondary user exists. If two secondary users choose the same channel for data transmission, they will collide with each other and the corresponding data packets cannot be decoded. Such a procedure is illustrated in Figure 1, where three secondary users access an access point via four channels. Note that such a scheme is similar to Aloha [9] where no explicit collision avoidance is applied. We can also apply techniques similar to https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq6_HTML.gif -persistent Carrier Sensing Multiple Access (CSMA) that is, each secondary user transmits with probability https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq7_HTML.gif when it finds an available channel. However, it is beyond the scope of this paper. In the Aloha-like approach, since there is no mutual communication among these secondary users, collision is unavoidable. However, the secondary users can try to learn collision avoidance, as well as channel qualities (we assume that the secondary users have no a priori information about the channel qualities), according to their experience. In such a context, the learning procedure includes not only the available frequency spectrum but also the behavior of other secondary users.
To accomplish the learning of Aloha-like spectrum access, multiagent reinforcement learning (MARL) [10] is applied in this paper. One challenge of MARL in our context is that the secondary users do not know the payoffs (thus do not know the strategies) of other secondary users in each stage; thus the environment of each secondary user, including other secondary users, is nonstationary and may not assure the convergence of learning. Due to the assumption that there is no mutual communication between different secondary users, many traditional MARL techniques like fictitious play [11, 12] and Nash-Q learning [13] cannot be used since they need information exchange among players (e.g., exchanging their action information). To alleviate the lack of mutual communication, we extend the principle of single-agent https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq8_HTML.gif -learning, that is, evaluating the values of different state-action pairs in an incremental way, to the multiagent situation without information exchange. By applying the theory of stochastic approximation [14], which has been used in many studies on wireless networks [15, 16], we will prove the main result of this paper, that is, the learning converges to a stationary point regardless of the initial strategies (Propositions 1 and 2).
Some studies on reinforcement learning in cognitive radio networks have been done [1719]. In [17] and [19], the studies are focused on the resource competition in a spectrum auction system, where the channel allocation is determined by the spectrum regulator, which is different from this paper in which no regulator exists. Reference [18] discusses correlated equilibrium and achieves it by no-regret learning; that is, minimizing the gap between the current reward and optimal reward. In this approach, mutual communication is needed among the secondary users. However, in our study, no intersecondary-user communication is assumed.
Note that the study in this paper has subtle similarities to the evolutionary game theory [20], which has been successfully applied in the cooperative spectrum sensing in cognitive radio systems [21]. Both our study and the evolutionary game focus on the dynamics of strategy changes of users. However, there is a key difference between the two studies. The evolutionary game theory assumes pure strategies for the players (e.g., cooperate or free-ride in cooperative spectrum sensing [21]) and studies the proportions of players using different pure strategies. The key equation in the evolutionary game theory, called replicator equation, describes the dynamics of the corresponding proportions. In contrast to the evolutionary game, the players in our study use mixed strategies and the basic (16) describes the dynamics of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq9_HTML.gif -values for different channels. Although the convergence is proved by studying ordinary different equations in both studies, the proof is significantly different since the equations have totally different expressions.
The remainder of this paper is organized as follows. In Section 2, the system model is introduced. Basic elements of the game and the proposed multiagent https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq10_HTML.gif -learning for fully observable case (i.e., each secondary user can sense all channels) are introduced in Section 3. The corresponding convergence of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq11_HTML.gif -learning is proved in Section 4. The https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq12_HTML.gif -learning for partially observable case (i.e., each secondary user can sense a subset of the channels) is discussed in Section 5. Numerical results are provided in Section 6, while conclusions are drawn in Section 7.

2. System Model

We consider https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq13_HTML.gif active secondary users accessing https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq14_HTML.gif licensed frequency channels. (When there are more than https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq15_HTML.gif channels, there is less competition; thus making the problem easier. We do not consider the case when the number of channels is less than the number of secondary users since a typical cognitive radio system can provide sufficient channels. Meanwhile, the proposed algorithm can also be applied to all possible cases of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq16_HTML.gif ) We index the secondary users, as well as the channels, by integers 1, 2, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq17_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq18_HTML.gif . For simplicity, we denote by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq19_HTML.gif the set of users (channels) different from user (channel) https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq20_HTML.gif .
The following assumptions are made throughout this paper.
(i)
The secondary users are sufficiently close to each other such that they share the same activity of primary users. There is no communication among these secondary users, thus excluding the possibility of negotiation.
 
(ii)
We assume that the activity of primary users over each channel is a Markov chain (A more reasonable model for the activity of primary users is the semi-Markov chain. The corresponding analysis is more tedious but similar to that in this paper. Therefore, for simplicity of analysis, we consider only Markov chain in this paper)with states https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq21_HTML.gif (busy: the channel is occupied by primary users and cannot be used by secondary users) and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq22_HTML.gif (idle: there is no primary user over this channel). We denote by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq23_HTML.gif the state of channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq24_HTML.gif in the sensing period of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq25_HTML.gif th spectrum access period. For channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq26_HTML.gif , the transition probability from state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq27_HTML.gif to state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq28_HTML.gif (resp., from state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq29_HTML.gif to state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq30_HTML.gif ) is denoted by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq31_HTML.gif (resp., https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq32_HTML.gif ). We assume that the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq33_HTML.gif Markov chains for the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq34_HTML.gif channels are mutually independent. We also assume perfect spectrum sensing and do not consider possible errors of spectrum sensing.
 
(iii)
We assume that the channel state transition probabilities, as well as the channel rewards, are unknown with the secondary users at the beginning. They are fixed throughout the game, unless otherwise noted. Therefore, the secondary users need to learn the channel properties.
 
(iv)
The timing structure of spectrum sensing and data transmission is illustrated in Figure 2, where data is transmitted after the spectrum sensing period. We assume that each secondary user is able to sense only one channel during the spectrum sensing period and transmit over only one channel during the data transmission period.
 
In Sections 3 and 4, we consider the case in which all secondary users have full knowledge of channel states in the previous spectrum access period (complete observation). Note that this does not contradict the assumption that a secondary user can sense only one channel during the spectrum sensing period since the secondary user can continue to sense other channels during the data transmission period (suppose that the signal from primary users can be well distinguished from that from secondary users, e.g., using different cyclostationary features [22]). If we consider the set of channel states in the previous spectrum access period as the system state, denoted by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq35_HTML.gif at spectrum access period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq36_HTML.gif , then the previous assumption implies a completely observable system state, which substantially simplifies the analysis. In Section 5, we will also study the case in which secondary users cannot continue to sense during the data transmission period (partial observation); thus each secondary user has only partial observations about the system state.

3. Game and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq37_HTML.gif -Learning

In this section, we introduce the game associated to the learning procedure and the application of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq38_HTML.gif -learning to the Aloha-like spectrum access problem. Note that in this section and Section 3, we assume that each secondary user knows all channel states in the previous time slot, that is, the completely observable case.

3.1. Game of Aloha-Like Spectrum Access

The Aloha-like spectrum access problem is essentially an https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq39_HTML.gif game. When secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq40_HTML.gif transmits over an idle channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq41_HTML.gif , it receives reward https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq42_HTML.gif (e.g., channel capacity or successful transmission probability), if no other secondary user transmits over this channel, and reward 0, if one or more other secondary users are transmitting over this channel, since collision will happen. We assume that the reward https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq43_HTML.gif does not change with time. When channels change slowly, the learning algorithm proposed in this paper can also be applied to track the change of channels. When channels change very fast, it is impossible for secondary users to learn. Since there is no explicit information exchange among secondary users, the collision avoidance is completely based on the received reward. The payoff matrices for the case of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq44_HTML.gif are given in Figure 3. Note that the actions, denoted by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq45_HTML.gif for user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq46_HTML.gif at time https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq47_HTML.gif , in the game are the selections of channels. Obviously, the diagonal elements in the payoff matrices are all zero since collision yields zero reward.
It is well known that Nash equilibrium means the strategies such that unilaterally changing strategy incurs the degradation of its own performance. Mathematically, a Nash equilibrium means a set of strategies https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq48_HTML.gif , where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq49_HTML.gif is the strategy of player https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq50_HTML.gif , which satisfy
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ1_HTML.gif
(1)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq51_HTML.gif means the reward of player https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq52_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq53_HTML.gif means the strategies of all players except player https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq54_HTML.gif .
It is easy to verify that there are multiple Nash equilibrium points in the game. Obviously, orthogonal transmission strategies, that is, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq55_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq56_HTML.gif , are pure equilibria. The reason is the following. If a secondary user changes its strategy and transmits over other channels with nonzero probability, those transmission will collide with other secondary users (recall that, for the Nash equilibrium, all other secondary users do not change their strategies) and incurs performance degradation. The orthogonal channel assignment can be achieved in the following approach: let all secondary users sense the channel randomly at the very beginning; once a secondary user finds an idle channel, it will access this channel forever; after a random number of rounds, all secondary users will find different channels, thus achieving the orthogonal transmission. We call this scheme the simple orthogonal channel assignment since it is simple and fast. However, in this scheme, the different rewards of different channels are ignored. As will be seen in the numerical simulation results, the proposed learning procedure can significant outperform the simple orthogonal channel assignment.

3.2. https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq57_HTML.gif -Value

We define the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq58_HTML.gif -function as the expected reward in one time slot (since the channel states are completely known to the secondary users and are not controlled by the secondary users, each secondary user needs to consider only the expected reward in one time slot, that is, a myopic strategy) of each action under different states; that is, for secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq59_HTML.gif and system state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq60_HTML.gif , the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq61_HTML.gif -value of choosing channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq62_HTML.gif is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ2_HTML.gif
(2)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq63_HTML.gif is the reward obtained by secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq64_HTML.gif , which is dependent on the action, as well as the system state, and the expectation is over the randomness of other users' actions, as well as the primary users' occupancies.

3.3. Exploration

In contrast to fictitious play [11], which is deterministic, the action in https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq65_HTML.gif -learning is random. We assign nonzero probabilities for all channels such that all channels will be explored. Such an exploration guarantees that good channels will not be missed during the learning procedure. We consider Boltzmann distribution [23] for random exploration, that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ3_HTML.gif
(3)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq66_HTML.gif is called temperature, which controls the randomness of exploration. Obviously, the smaller https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq67_HTML.gif is (the colder), the more focused the actions are. When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq68_HTML.gif , each user chooses only the channel having the largest https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq69_HTML.gif -value.
When secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq70_HTML.gif selects channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq71_HTML.gif and the system state is https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq72_HTML.gif , the expected reward is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ4_HTML.gif
(4)
since secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq73_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq74_HTML.gif ) chooses channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq75_HTML.gif with probability https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq76_HTML.gif (collision happens and secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq77_HTML.gif receives no reward) and channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq78_HTML.gif is idle with probability https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq79_HTML.gif ; then the product in (4) is the probability that no other secondary user accesses channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq80_HTML.gif .

3.4. Updating https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq81_HTML.gif -Values

In the procedure of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq82_HTML.gif -learning, the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq83_HTML.gif -functions are updated after each spectrum access using the following rule:
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ5_HTML.gif
(5)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq84_HTML.gif is a step factor (when channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq85_HTML.gif is not selected by user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq86_HTML.gif , we set https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq87_HTML.gif ), https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq88_HTML.gif is the reward of secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq89_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq90_HTML.gif is the characteristic function of the event that channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq91_HTML.gif is selected by secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq92_HTML.gif at the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq93_HTML.gif th spectrum access. Note that this is the standard https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq94_HTML.gif -learning without considering the future states. An intuitive explanation for (5) is that, once channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq95_HTML.gif is accessed, the corresponding https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq96_HTML.gif -value is updated by combining the old value and the new reward; if channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq97_HTML.gif is not chosen, we keep the old value by setting https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq98_HTML.gif . Our study is focused on the dynamics of (5). To assure convergence, we assume that
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ6_HTML.gif
(6)
as well as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ7_HTML.gif
(7)
Note that, in a typical stochastic game setting and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq99_HTML.gif -learning, the updating rule in (5) should consider the reward of the future and add a discounted term of the future reward to the right hand side of (5). However, in this paper, the optimal strategy is myopic since we assume that the system state is known, and thus the secondary users' actions do not affect the system state. For the case of partial observation (i.e., each secondary user knows only the state of a single channel), the action does change each secondary user's state (typically the belief of system state), and the future reward should be included in the right hand side of (5), which will be discussed in Section 5.

3.5. Stationary Point

The https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq100_HTML.gif -values for different users are mutually coupled and all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq101_HTML.gif -values change if one https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq102_HTML.gif -value is changed since the strategy of the corresponding user is changed, thus changing the expected rewards of other users. We define https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq103_HTML.gif -values satisfying the following equations as a stationary point
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ8_HTML.gif
(8)
Note that the stationarity is only in the statistical sense since the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq104_HTML.gif -values can fluctuate around the stationary point due to the randomness of exploration. Obviously, as https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq105_HTML.gif , the stationary point converges to a Nash equilibrium point. However, we are still not sure about the existence of such a stationary point. The following lemma assures the existence of stationary point. The proof is given in Appendix .
Lemma 1.
For sufficiently small https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq106_HTML.gif , there exists at least one stationary point satisfying (8).

4. Convergence of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq107_HTML.gif -learning without Information Exchange

In this section, we study the convergence of the proposed https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq108_HTML.gif -learning algorithm. First, we provide an intuitive explanation for the convergence in https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq109_HTML.gif case. Then, we apply the tools of stochastic approximation and ordinary differential equation (ODE) to prove the convergence rigorously.

4.1. Intuition on Convergence

As will be shown in Proposition 1, the updating rule of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq110_HTML.gif -values in (5) will converge to a stationary equilibrium point close to Nash equilibrium. Before the rigorous proof, we provide an intuitive explanation for the convergence using the geometric argument proposed in [24].
The intuitive explanation is provided in Figure 4 for the case of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq111_HTML.gif (we call it Metrick-Polak plot since it was originally proposed by A. Metrick and B. Polak in [24]). For simplicity, we ignore the indices of state and assume that both channels are idle. The axes are https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq112_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq113_HTML.gif , respectively. As labeled in the figure, the plane is divided into four regions separated by two lines https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq114_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq115_HTML.gif , in which the dynamics of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq116_HTML.gif -learning are different. We discuss these four regions separately.
(i)
Region I: in this region, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq117_HTML.gif ; therefore, secondary user 1 prefers visiting channel 1; meanwhile, secondary user 2 prefers accessing channel 2 since https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq118_HTML.gif ; then, with large probability, the strategies will converge to a stationary point in which secondary users 1 and 2 access channels 1 and 2, respectively.
 
(ii)
Region II: in this region, both secondary users prefer accessing channel 1, thus causing many collisions. Therefore, both https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq119_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq120_HTML.gif will be reduced until entering either region I or region III.
 
(iii)
Region III: similar to region I.
 
(iv)
Region IV: similar to region II.
 
Then, we observe that the points in Regions II and IV are unstable and will move into Region I or III with large probability. In Regions I and III, the strategy will move close to the stationary point with large probability. Therefore, regardless where the initial point is, the updating rule in (5) will converge to a stationary point with large probability.

4.2. Stochastic Approximation-Based Convergence

In this section, we prove the convergence of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq123_HTML.gif -learning of the proposed Aloha-like spectrum access with Boltzman distributed exploration. First, we find the equivalence between the updating rule (5) and Robbins-Monro iteration [25] for solving an equation with unknown expression (a brief introduction is provided in Appendix ). Then, we apply a conclusion in stochastic approximation [14] to relate the dynamics of the updating rule to an ODE and prove the convergence of the ODE.

4.2.1. Robbins-Monro Iteration

At a stationary point, the expected values of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq124_HTML.gif -functions satisfy the equations in (8). For system state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq125_HTML.gif , define
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ9_HTML.gif
(9)
Then, (8) can be rewritten as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ10_HTML.gif
(10)
where
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ11_HTML.gif
(11)
and (function mod https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq126_HTML.gif means the remainder of dividing integer https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq127_HTML.gif with integer https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq128_HTML.gif )
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ12_HTML.gif
(12)
with convention https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq129_HTML.gif . Obviously, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq130_HTML.gif is the probability that channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq131_HTML.gif can be used by secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq132_HTML.gif without collision with other secondary users, when the current system state is https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq133_HTML.gif .
Then, the updating rule in (5) is equivalent to solving (8) (the expression of the equation is unknown since the rewards, channel transition probabilities, as well as the strategies of other users, are all unknown) using Robbins-Monro algorithm [14], that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ13_HTML.gif
(13)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq134_HTML.gif is the vector of all step factors, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq135_HTML.gif is the vector of rewards obtained at spectrum access period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq136_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq137_HTML.gif is a random observation on function https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq138_HTML.gif contaminated by noise, that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ14_HTML.gif
(14)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq139_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq140_HTML.gif is noise and (recall that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq141_HTML.gif means the reward of secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq142_HTML.gif at time https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq143_HTML.gif )
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ15_HTML.gif
(15)
Obviously, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq144_HTML.gif since the expectation of the difference between the reward and the expected reward is equal to 0. Therefore, the observation https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq145_HTML.gif is a Martingale difference.

4.2.2. ODE and Convergence

The procedure of Robbins-Monro algorithm (i.e., the updating of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq146_HTML.gif -value) is the stochastic approximation of the solution of the equation. It is well known that the convergence of such a procedure can be characterized by an ODE. Since the noise https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq147_HTML.gif in (14) is a Martingale difference, it is easy to verify the conditions in Theorem 1 in Appendix and obtain the following lemma (the proof is given in Appendix ).
Lemma 2.
With probability 1, the sequence https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq148_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq149_HTML.gif , converges to some limit set of the ODE
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ16_HTML.gif
(16)
What remains to do is to analyze the convergence property of the ODE (16). We obtain the following lemma by applying Lyapunov function. The proof is given in Appendix .
Lemma 3.
If a stationary point determined by (10) exists, the solution of ODE (16) converges to the stationary point for sufficiently large https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq150_HTML.gif .
Combining Lemmas 1, 2, and 3, we obtain the main result in this paper.
Proposition 1.
Suppose that a stationary point determined by (10) exists. For any system state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq151_HTML.gif and sufficiently large https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq152_HTML.gif , the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq153_HTML.gif -learning converges to a stationary point with probability 1.
Note that a sufficiently small https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq154_HTML.gif guarantees the existence of stationary point and a sufficiently large https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq155_HTML.gif assures the convergence of the learning procedure. However, they do not conflict since they are not necessary conditions. As we found in our simulations, we can always choose a suitable https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq156_HTML.gif to guarantee the existence of the stationary point and the convergence.

5. https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq157_HTML.gif -Learning with Partial Observations

In this section, we remove the assumption that all secondary users know all channel states in the previous spectrum access period and assume that each secondary user knows the state of only the channel sensed in the previous spectrum access period; thus making the system state partially observable. The difficulties of analyzing such a scenario are given below:
(i)
The system state is partially observable.
 
(ii)
The game is imperfectly monitored, that is, each player does not know other players' actions.
 
(iii)
The game has incomplete information, that is, each player does not know the strategies of other players, as well as their beliefs on the system state.
 
Note that the latter two difficulties are common for both the complete and partial observation cases. However, the imperfect monitoring and incomplete information add much more difficulty in the partial observation case. In this section, we formulate the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq158_HTML.gif -learning algorithm and then prove the convergence under certain conditions.

5.1. State Definition

It is well known that, in partially observable Markov decision process (POMDP) problems, the belief on the system state, that is, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq159_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq160_HTML.gif is the observation history before period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq161_HTML.gif ), can play the role of system state. Due to the special structure of the game, we can define the state of secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq162_HTML.gif at period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq163_HTML.gif as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ17_HTML.gif
(17)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq164_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq165_HTML.gif , means the number of consecutive periods during which channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq166_HTML.gif has not been sensed before period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq167_HTML.gif (e.g., if the last time that channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq168_HTML.gif was sensed by secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq169_HTML.gif is time slot https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq170_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq171_HTML.gif .) and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq172_HTML.gif is the state of channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq173_HTML.gif in the last time when it is sensed before period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq174_HTML.gif .

5.2. Learning in the POMDP Game

For the purpose of learning, we define the objective function for user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq175_HTML.gif as the discounted sum of rewards in each spectrum access period with discount factor https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq176_HTML.gif , that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ18_HTML.gif
(18)
Then, to maximize the objectively function, the corresponding https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq177_HTML.gif -learning strategy is given by [23]
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ19_HTML.gif
(19)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq178_HTML.gif is uniquely determined by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq179_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq180_HTML.gif , and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq181_HTML.gif is the step factor dependent on the time, channel, user, and belief state. Note that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq182_HTML.gif is the system state in the next time slot, which is random. Intuitively, the new https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq183_HTML.gif -value is updated by combining the old value and the new estimation, which is the sum of the new reward and discounted old https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq184_HTML.gif -value.
Similarly to the complete information situation, we have the following proposition which states the convergence of the learning procedure with partial information and large https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq185_HTML.gif . The proof is given in Appendix . Note that numerical simulation shows that small https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq186_HTML.gif also results in convergence. However, we are still unable to prove it rigorously.
Proposition 2.
When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq187_HTML.gif is sufficiently large, the learning procedure in (19) converges.

6. Numerical Results

In this section, we use numerical simulations to demonstrate the theoretical results obtained in previous sections. For the fully observable case, we use the following step factor:
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ20_HTML.gif
(20)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq188_HTML.gif is the initial learning factor. A similar step factor https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq189_HTML.gif is used for the partially observable case. In Sections 6.1, 6.2, and 6.3, we consider the fully observable case and, in Section 6.4, we consider the partially observable case. Note that, in all simulations, we initialize the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq190_HTML.gif -values by choosing uniformly random variables in the interval https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq191_HTML.gif .

6.1. Dynamics

Figures 5 and 6 show the dynamics of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq192_HTML.gif versus https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq193_HTML.gif (recall that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq194_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq195_HTML.gif ) of several typical trajectories for the state of both channels being idle when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq196_HTML.gif . We assume that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq197_HTML.gif for all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq198_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq199_HTML.gif . Note that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq200_HTML.gif in Figure 5 and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq201_HTML.gif in Figure 6. We observe that the trajectories move from unstable regions (II and IV in Figure 4) to stable regions (I and III in Figure 4). We also observe that the trajectories for smaller temperature https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq202_HTML.gif is smoother since less explorations are carried out.
Figure 7 shows the evolution of the probability of choosing channel 1 when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq209_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq210_HTML.gif and both channels are idle. We observe that both secondary users prefer channel 1 at the beginning and soon secondary user 1 intends to choose channel 2, thus avoiding collisions.

6.2. CDF of Rewards

In this subsection, we consider the performance of reward averaged over all system states. When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq212_HTML.gif , we set https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq213_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq214_HTML.gif for the three channels, respectively. When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq215_HTML.gif , we use the first two channels in the case of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq216_HTML.gif . The rewards of different channels for different secondary users are randomly generated with a uniform distribution between https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq217_HTML.gif . The CDF curves of performance gain, defined as the difference of average rewards after and before the learning procedure, are plotted in Figure 8 for both https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq218_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq219_HTML.gif . Note that the CDF curves are obtained from 100 realizations of learning procedure. From a CDF curve, we can read the distribution of the performance gains. For example, for the curve https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq220_HTML.gif , performance gain 0.4 in the horizontal axis corresponds to 0.6 in the vertical axis; this means that around 60% of the secondary users obtain performance gain less than 0.4. We observe that when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq221_HTML.gif , most performance gains are positive. However, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq222_HTML.gif , a small portion of the performance gains are negative, that is, the performance is decreased after the learning procedure. Such a performance reduction is reasonable since Nash equilibrium may not be Pareto optimal. We also plotted the average performance gains versus different https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq223_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq224_HTML.gif in Figure 9. We observe that larger https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq225_HTML.gif results in worse performance gain. When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq226_HTML.gif is small, smaller https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq227_HTML.gif yields better performance, but decreases faster than larger https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq228_HTML.gif when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq229_HTML.gif increases. The performance gain over the simple orthogonal channel assignment scheme is given in Figure 10. We observe that the learning procedure generates a much better performance than the simple orthogonal channel assignment.

6.3. Learning Speed

We define the stopping time of learning as the time that the relative fluctuation of average reward, which is obtained from 2000 spectrum access periods using the current https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq234_HTML.gif -values, has been below 5 percent for successive 5 time slots. That is, compute the relative fluctuation at time slot https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq235_HTML.gif using
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ21_HTML.gif
(21)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq236_HTML.gif is the vector containing all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq237_HTML.gif -values and the norm is 2-norm. Then, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq238_HTML.gif is smaller than 0.05 for 5 consecutive times, we claim that the learning is completed. Then, the learning delay is the time spent before the stopping time. Obviously, the smaller the learning delay is, the faster the learning is. Figures 11 and 12 show the delays of learning, which characterizes the learning speed, for different learning factor https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq239_HTML.gif and different temperature https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq240_HTML.gif , respectively, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq241_HTML.gif . The original https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq242_HTML.gif values are randomly selected. When the probabilities of choosing channel 1 are larger than 0.95 for one secondary user and smaller than 0.05 for the other secondary user, we claim that the learning procedure is completed. We observe that larger learning factor https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq243_HTML.gif results in smaller delay while smaller https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq244_HTML.gif yields faster learning procedure.
The speed of learning is compared for https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq249_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq250_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq251_HTML.gif in Figure 13 (both https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq252_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq253_HTML.gif are fixed). We observe that, for more than 90% of the realizations, the learning can be completed within 20 spectrum access periods. However, the learning procedure may last for a long period of time for some situations. We can notice that the learning speeds are similar for cases https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq254_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq255_HTML.gif . We also observe that, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq256_HTML.gif is much larger ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq257_HTML.gif ), the increase of delay is not significant.

6.4. Time-Varying Channel Rewards

In previous simulations, the rewards of successfully accessing channels, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq258_HTML.gif are assumed to be constant. In practical systems, they may change with time since wireless channels are usually dynamic. In Figure 14, we show the CDF of performance gains (the configuration is the same as that in Figure 8) when channel changes slowly. We used a simple model for channel reward, which is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ22_HTML.gif
(22)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq259_HTML.gif is a random variable uniformly distributed between 0 and 1. From Figure 14, we observe that the learning algorithm still improves the performance significantly.

6.5. Partial Observation Case

Figure 15 shows the performance gain of learning in the case of partial observations. We adopt the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq260_HTML.gif -learning mechanism introduced in Section 5. Note that there are infinitely many belief states since a channel could be unsensed for an infinite period of time. For computational simplicity, we set all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq261_HTML.gif to https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq262_HTML.gif (recall that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq263_HTML.gif is the period of time that channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq264_HTML.gif has not been sensed by user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq265_HTML.gif before time https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq266_HTML.gif ). From Figure 15, we observe that the performance is actually degraded for around 40% ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq267_HTML.gif ) or 50% ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq268_HTML.gif ) cases. However, the amplitude of performance degradation is averagely less than the amplitude of performance gain. We also observe that the performance gain is decreased when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq269_HTML.gif is increased from 2 to 3.
The learning delay for the partial observation case is shown in Figure 16, where the simulation setup is similar to that of Figure 13. Again, we observe that the learning speeds of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq271_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq272_HTML.gif are similar to each other.

7. Conclusions

We have discussed a learning procedure for Aloha-like spectrum access without negotiation in cognitive radio systems. During the learning, each secondary user considers the channel and other secondary users as its environment, updates its https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq273_HTML.gif -values, and takes the best action. An intuitive explanation for the convergence of learning is provided using Metrick-Polak plot. By applying the theory of stochastic approximation and ODE, we have shown the convergence of learning under certain conditions. We also extended the case of full observations to the case of partial observations. Numerical results show that secondary users can learn to avoid collision quickly. The performance after the learning is significantly better than that before the learning and that using a simple scheme to achieve a Nash equilibrium. Note that our study is one extreme of the resource allocation problem since no negotiation is considered, while the other extreme is full negotiation to achieve optimal performance. Our future work will be the intermediate case; that is, limited negotiation for resource allocation.

Acknowledgment

This work was supported by the National Science Foundation under grant CCF-0830451.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Anhänge

Appendices

A. Stochastic Approximation

For being self-contained, we briefly introduce the theory of stochastic approximation and cite the conclusion used for proving Lemma 2.
Essentially, stochastic approximation is used to solve an equation with unknown expression and noisy observations. Consider equation
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ23_HTML.gif
(A.1)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq274_HTML.gif is the unknown variable and the expression of function https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq275_HTML.gif is unknown. Denote by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq276_HTML.gif the solution to this equation (we assume that there is only one solution to the equation). Suppose that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq277_HTML.gif for https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq278_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq279_HTML.gif for https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq280_HTML.gif . We have a series of noisy observations of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq281_HTML.gif , denoted by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq282_HTML.gif . Then, we can approximate the solution iteratively in the following way (called Robbins-Monro algorithm).
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ24_HTML.gif
(A.2)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq283_HTML.gif is the step for the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq284_HTML.gif th iteration.
The convergence of (A.2) is deeply related to a "mean" ODE, which is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ25_HTML.gif
(A.3)
The following Theorem (part of Theorem https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq285_HTML.gif in [14]) discloses the relationship between the convergence in (A.2) and the mean ODE in (A.3).
Theorem 1.
If the following assumptions are satisfied, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq286_HTML.gif converges to a limit set https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq287_HTML.gif in which all points satisfy https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq288_HTML.gif with probability 1:
(A)
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq289_HTML.gif
 
(B)
Noise is a martingale difference, that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ26_HTML.gif
(A.4)
 
(C)
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq290_HTML.gif is continuous.
 
(D)
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq291_HTML.gif .
 
(E)
There exists a continuously differentiable function https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq292_HTML.gif such that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq293_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq294_HTML.gif is a constant on the limit set https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq295_HTML.gif .
 

B. Proof of Lemma 1

Proof.
For simplicity, we fix one system state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq296_HTML.gif since the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq297_HTML.gif -learning procedures for different state https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq298_HTML.gif are mutually independent when the system state is fully observable and the action of each secondary user does not affect the system state. Consider a Nash equilibrium point, at which there is no collision. Without loss of generality, we assume that secondary users https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq299_HTML.gif use channels https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq300_HTML.gif , respectively.
Now, we choose a set of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq301_HTML.gif such that
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ27_HTML.gif
(B.1)
Then, we can always choose a sufficiently small https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq302_HTML.gif such that
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ28_HTML.gif
(B.2)
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ29_HTML.gif
(B.3)
since the right hand sides of (B.2) and (B.3) converge to https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq303_HTML.gif and 0, respectively, as https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq304_HTML.gif .
Then, we carry out the following iterations, that is, the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq305_HTML.gif -values of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq306_HTML.gif th iteration is given by, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq307_HTML.gif ,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ30_HTML.gif
(B.4)
Next, we show that, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq308_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq309_HTML.gif increases while https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq310_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq311_HTML.gif ) decreases during the iterations by carrying out inductions on https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq312_HTML.gif . For the first iteration, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq313_HTML.gif is increased while https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq314_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq315_HTML.gif ) is decreased due to the conditions (B.2) and (B.3). Suppose that, in the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq316_HTML.gif th iteration, the conclusion holds. Then, in the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq317_HTML.gif th iteration, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq318_HTML.gif is increased due to the expression of the right hand side of (B.4) and the assumptions https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq319_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq320_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq321_HTML.gif ). For the same reason, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq322_HTML.gif is decreased in the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq323_HTML.gif th iteration. This concludes the induction.
Now, we have shown that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq324_HTML.gif is a monotonically increasing sequence while https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq325_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq326_HTML.gif ) is a monotonically decreasing sequence. Since all sequences are bounded ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq327_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq328_HTML.gif ( https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq329_HTML.gif )), all sequences converge to their limits, which is the stationary point. This concludes the proof.

C. Proof of Lemma 2

Proof.
We verify the conditions in Theorem 1 one by one.
(i)
Condition (A): This is obvious since https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq330_HTML.gif is upper bounded by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq331_HTML.gif (recall that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq332_HTML.gif is the difference between the instantaneous reward and the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq333_HTML.gif -value).
 
(ii)
Condition (B): The martingale difference noise has been proved right after (14).
 
(iii)
Condition (C): The function https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq334_HTML.gif is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ31_HTML.gif
(C.1)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq335_HTML.gif is defined in (12). We only need to verify the continuity of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq336_HTML.gif . Obviously, each element in https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq337_HTML.gif is differentiable with respect to https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq338_HTML.gif . Therefore, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq339_HTML.gif is not only continuous but also differentiable.
 
(iv)
Condition (D): It is guaranteed by (7).
 
(v)
Condition (E): The function https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq340_HTML.gif can be defined as the integral of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq341_HTML.gif . It is continuously differentiable since https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq342_HTML.gif is continuous. It is a constant on the limit set since there is only one point at the limit set.
 

D. Proof of Lemma 3

Proof.
We apply Lyapunov's method to analyze the convergence of the ODE in (16). We define the Lyapunov function as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ32_HTML.gif
(D.1)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq343_HTML.gif is the expected reward of secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq344_HTML.gif at period https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq345_HTML.gif .
Then, we examine the derivative of the Lyapunov function with respect to time https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq346_HTML.gif , that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ33_HTML.gif
(D.2)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq347_HTML.gif .
We have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ34_HTML.gif
(D.3)
where we applied the ODE (16).
Then, we focus on the computation of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq348_HTML.gif .
For secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq349_HTML.gif and channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq350_HTML.gif , we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ35_HTML.gif
(D.4)
where the first equation is due to the definition of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq351_HTML.gif and the second equation is due to the rule of the derivative of products.
We consider the derivative in (D.4), that is,
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ36_HTML.gif
(D.5)
where the last equation is obtained from ODE (16).
Substituting (D.5) into (D.4), we obtain
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ37_HTML.gif
(D.6)
Combining (D.2) and (D.6), we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ38_HTML.gif
(D.7)
where the coefficient https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq352_HTML.gif is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ39_HTML.gif
(D.8)
if https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq353_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq354_HTML.gif , and
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ40_HTML.gif
(D.9)
if https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq355_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq356_HTML.gif . When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq357_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq358_HTML.gif .
It is easy to verify
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ41_HTML.gif
(D.10)
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ42_HTML.gif
(D.11)
Therefore, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq359_HTML.gif is sufficiently large, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ43_HTML.gif
(D.12)
Then, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ44_HTML.gif
(D.13)
Therefore, when https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq360_HTML.gif is sufficiently large, the derivative of the Lyapunov function is strictly negative, which implies that the ODE (16) converges to a stationary point. This concludes the proof.
Remark 1.
We can actually obtain a stronger conclusion from the last part of the proof, that is, the convergence can be assured if
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ45_HTML.gif
(D.14)

E. Proof of Proposition 2

Proof.
We define a mapping from all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq361_HTML.gif -values to another set of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq362_HTML.gif -values, which is given by
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ46_HTML.gif
(E.1)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq363_HTML.gif is determined by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq364_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq365_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq366_HTML.gif is the average reward when secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq367_HTML.gif chooses channel https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq368_HTML.gif . Note that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq369_HTML.gif is a function of all https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq370_HTML.gif -values.
What we need to prove is that https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq371_HTML.gif is a contraction mapping. Once this is proved, the remainder part is exactly the same as the proof of the convergence of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq372_HTML.gif -learning in [26]. Therefore, we focus on the analysis on the mapping https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq373_HTML.gif .
We consider two sets of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq374_HTML.gif -values, denoted by https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq375_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq376_HTML.gif , respectively. Considering the difference after the mapping https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq377_HTML.gif between the two sets of https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq378_HTML.gif -values, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ47_HTML.gif
(E.2)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq379_HTML.gif means the average reward when the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq380_HTML.gif -values are https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq381_HTML.gif . Then, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ48_HTML.gif
(E.3)
We discuss the two terms in (E.3) separately. For the first term, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ49_HTML.gif
(E.4)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq382_HTML.gif is the set of states of secondary users except user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq383_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq384_HTML.gif is the probability of the set of states https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq385_HTML.gif conditioned on the state of secondary user https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq386_HTML.gif , https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq387_HTML.gif .
When https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq388_HTML.gif is sufficiently large, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ50_HTML.gif
(E.5)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq389_HTML.gif is a polynomial of order https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq390_HTML.gif .
Then, it is easy to verify that (E.4) can be rewritten as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ51_HTML.gif
(E.6)
where https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq391_HTML.gif and https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq392_HTML.gif are both polynomials of smaller order than https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq393_HTML.gif . Note that the coefficients of both polynomials are independent of the https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq394_HTML.gif -values.
Then, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ52_HTML.gif
(E.7)
Then, we can always take a sufficiently large https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq395_HTML.gif such that
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ53_HTML.gif
(E.8)
Now, we turn to the second term in (E.3). Without loss of generality, we assume https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq396_HTML.gif . We have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ54_HTML.gif
(E.9)
where, in the first inequality, we define https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq397_HTML.gif as
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ55_HTML.gif
(E.10)
Due to symmetry, we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ56_HTML.gif
(E.11)
Combining (E.8) and (E.11), we have
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ57_HTML.gif
(E.12)
which implies
https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_Equ58_HTML.gif
(E.13)
Therefore, https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq398_HTML.gif is a contraction mapping under the norm https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq399_HTML.gif . This concludes the proof.
Remark 2.
Note that, in contrast to the stochastic approximation approach for the proof of the convergence in the complete observation case, we used a different approach to prove the convergence of the learning with partial observations since it is difficult to apply the stochastic approximation in the partial observation case. Although the stochastic approximation approach is slightly more complicated, we can find a finite value for https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq400_HTML.gif in (D.14) to assure the convergence. For the contraction mapping approach, we are still unable to find such a finite value for https://static-content.springer.com/image/art%3A10.1155%2F2010%2F876216/MediaObjects/13638_2009_Article_2051_IEq401_HTML.gif .
Literatur
1.
Zurück zum Zitat Mitola J III: Cognitive radio for flexible mobile multimedia communications. Mobile Networks and Applications 2001, 6(5):435-441. 10.1023/A:1011426600077CrossRefMATH Mitola J III: Cognitive radio for flexible mobile multimedia communications. Mobile Networks and Applications 2001, 6(5):435-441. 10.1023/A:1011426600077CrossRefMATH
2.
Zurück zum Zitat Mitola J III: Cognitive Radio, Licentiate Proposal. KTH, Stockholm, Sweden; 1998. Mitola J III: Cognitive Radio, Licentiate Proposal. KTH, Stockholm, Sweden; 1998.
3.
Zurück zum Zitat Zhao Q, Sadler BM: A survey of dynamic spectrum access. IEEE Signal Processing Magazine 2007, 24(3):79-89.CrossRef Zhao Q, Sadler BM: A survey of dynamic spectrum access. IEEE Signal Processing Magazine 2007, 24(3):79-89.CrossRef
4.
Zurück zum Zitat Kim K, Akbar IA, Bae KK, Um J-S, Spooner CM, Reed JH: Cyclostationary approaches to signal detection and classification in cognitive radio. Proceedings of the 2nd IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, April 2007 212-215.CrossRef Kim K, Akbar IA, Bae KK, Um J-S, Spooner CM, Reed JH: Cyclostationary approaches to signal detection and classification in cognitive radio. Proceedings of the 2nd IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, April 2007 212-215.CrossRef
5.
Zurück zum Zitat Li H, Li C, Dai H: Quickest spectrum sensing in cognitive radio. Proceedings of the 42nd Annual Conference on Information Sciences and Systems (CISS '08), March 2008, Princeton, NJ, USA 203-208. Li H, Li C, Dai H: Quickest spectrum sensing in cognitive radio. Proceedings of the 42nd Annual Conference on Information Sciences and Systems (CISS '08), March 2008, Princeton, NJ, USA 203-208.
6.
Zurück zum Zitat Ghasemi A, Sousa ES: Collaborative spectrum sensing for opportunistic access in fading environments. Proceedings of the 1st IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN '05), November 2005 131-136. Ghasemi A, Sousa ES: Collaborative spectrum sensing for opportunistic access in fading environments. Proceedings of the 1st IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DySPAN '05), November 2005 131-136.
7.
Zurück zum Zitat Kloeck C, Jaekel H, Jondral F: Multi-agent radio resource allocation. Mobile Networks and Applications 2006, 11(6):813-824. 10.1007/s11036-006-0051-4CrossRef Kloeck C, Jaekel H, Jondral F: Multi-agent radio resource allocation. Mobile Networks and Applications 2006, 11(6):813-824. 10.1007/s11036-006-0051-4CrossRef
8.
Zurück zum Zitat Niyato D, Hossain E, Han Z: Dynamics of multiple-seller and multiple-buyer spectrum trading in cognitive radio networks: a game-theoretic modeling approach. IEEE Transactions on Mobile Computing 2009, 8(8):1009-1022.CrossRef Niyato D, Hossain E, Han Z: Dynamics of multiple-seller and multiple-buyer spectrum trading in cognitive radio networks: a game-theoretic modeling approach. IEEE Transactions on Mobile Computing 2009, 8(8):1009-1022.CrossRef
9.
Zurück zum Zitat Kuo FF: The ALOHA system. In Computer Networks. Prentice-Hall, Englewood Cliffs, NJ, USA; 1973:501-518. Kuo FF: The ALOHA system. In Computer Networks. Prentice-Hall, Englewood Cliffs, NJ, USA; 1973:501-518.
10.
Zurück zum Zitat Buşoniu L, Babuška R, De Schutter B: A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics Part C 2008, 38(2):156-172.CrossRef Buşoniu L, Babuška R, De Schutter B: A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics Part C 2008, 38(2):156-172.CrossRef
11.
Zurück zum Zitat Fudenberg D, Levine DK: The Theory of Learning in Games. The MIT Press, Cambridge, Mass, USA; 1998.MATH Fudenberg D, Levine DK: The Theory of Learning in Games. The MIT Press, Cambridge, Mass, USA; 1998.MATH
13.
Zurück zum Zitat Hu J, Wellman MP: Multiagent reinforcement learning: theoretical framework and an algorithm. Proceedings of the 15th International Conference on Machine Learning (ICML '98), July 1998 242-250. Hu J, Wellman MP: Multiagent reinforcement learning: theoretical framework and an algorithm. Proceedings of the 15th International Conference on Machine Learning (ICML '98), July 1998 242-250.
14.
Zurück zum Zitat Kushner HJ, Yin GG: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York, NY, USA; 2003.MATH Kushner HJ, Yin GG: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York, NY, USA; 2003.MATH
15.
Zurück zum Zitat Liu J, Yi Y, Proutiere A, Chiang M, Poor HV: Towards utility-optimal random access without message passing. Wireless Communications and Mobile Computing 2010, 10(1):115-128. 10.1002/wcm.897CrossRef Liu J, Yi Y, Proutiere A, Chiang M, Poor HV: Towards utility-optimal random access without message passing. Wireless Communications and Mobile Computing 2010, 10(1):115-128. 10.1002/wcm.897CrossRef
16.
Zurück zum Zitat Yi Y, de Veciana G, Shakkottai S: MAC scheduling with low overheads by learning neighborhood contention patterns. submitted to IEEE/ACM Transactions on Networking Yi Y, de Veciana G, Shakkottai S: MAC scheduling with low overheads by learning neighborhood contention patterns. submitted to IEEE/ACM Transactions on Networking
17.
Zurück zum Zitat Fu F, van der Schaar M: Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology 2009, 58(4):1904-1919.CrossRef Fu F, van der Schaar M: Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology 2009, 58(4):1904-1919.CrossRef
18.
Zurück zum Zitat Han Z, Pandana C, Liu KJK: Distributive opportunistic spectrum access for cognitive radio using correlated equilibrium and no-regret learning. Proceedings of IEEE Wireless Communications and Networking Conference (WCNC '07), March 2007 11-15. Han Z, Pandana C, Liu KJK: Distributive opportunistic spectrum access for cognitive radio using correlated equilibrium and no-regret learning. Proceedings of IEEE Wireless Communications and Networking Conference (WCNC '07), March 2007 11-15.
19.
Zurück zum Zitat van der Schaar M, Fu F: Spectrum access games and strategic learning in cognitive radio networks for delay-critical applications. Proceedings of the IEEE 2009, 97(4):720-739.CrossRef van der Schaar M, Fu F: Spectrum access games and strategic learning in cognitive radio networks for delay-critical applications. Proceedings of the IEEE 2009, 97(4):720-739.CrossRef
20.
Zurück zum Zitat Hofbauer J, Sigmund K: Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK; 1998.CrossRefMATH Hofbauer J, Sigmund K: Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK; 1998.CrossRefMATH
21.
Zurück zum Zitat Wang B, Liu KJR, Clancy TC: Evolutionary game framework for behavior dynamics in cooperative spectrum sensing. Proceedings of IEEE Conference on Global Communications (Globecom '08), November-December 2008, New Orleans, La, USA 3123-3127. Wang B, Liu KJR, Clancy TC: Evolutionary game framework for behavior dynamics in cooperative spectrum sensing. Proceedings of IEEE Conference on Global Communications (Globecom '08), November-December 2008, New Orleans, La, USA 3123-3127.
22.
Zurück zum Zitat Sutton PD, Nolan KE, Doyle LE: Cyclostationary signatures in practical cognitive radio applications. IEEE Journal on Selected Areas in Communications 2008, 26(1):13-24.CrossRef Sutton PD, Nolan KE, Doyle LE: Cyclostationary signatures in practical cognitive radio applications. IEEE Journal on Selected Areas in Communications 2008, 26(1):13-24.CrossRef
23.
Zurück zum Zitat Sutton RS, Barto AG: Reinforcement Learning: A Introduction. The MIT Press, Cambridge, Mass, USA; 1998. Sutton RS, Barto AG: Reinforcement Learning: A Introduction. The MIT Press, Cambridge, Mass, USA; 1998.
24.
Zurück zum Zitat Metrick A, Polak B: Fictitious play in 2 × 2 games: a geometric proof of convergence. Economic Theory 1994, 4(6):923-933. 10.1007/BF01213819MathSciNetCrossRefMATH Metrick A, Polak B: Fictitious play in 2 × 2 games: a geometric proof of convergence. Economic Theory 1994, 4(6):923-933. 10.1007/BF01213819MathSciNetCrossRefMATH
26.
Zurück zum Zitat Tsitsiklis JN: Asynchronous stochastic approximation and Q -learning. Machine Learning 1994, 16(3):185-202.MathSciNetMATH Tsitsiklis JN: Asynchronous stochastic approximation and Q -learning. Machine Learning 1994, 16(3):185-202.MathSciNetMATH
Metadaten
Titel
Multiagent -Learning for Aloha-Like Spectrum Access in Cognitive Radio Systems
verfasst von
Husheng Li
Publikationsdatum
01.12.2010
Verlag
Springer International Publishing
DOI
https://doi.org/10.1155/2010/876216

Weitere Artikel der Ausgabe 1/2010

EURASIP Journal on Wireless Communications and Networking 1/2010 Zur Ausgabe