Skip to main content
Erschienen in: Wireless Networks 8/2014

01.11.2014

Two timescale convergent Q-learning for sleep-scheduling in wireless sensor networks

verfasst von: L. A. Prashanth, Abhranil Chatterjee, Shalabh Bhatnagar

Erschienen in: Wireless Networks | Ausgabe 8/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we consider an intrusion detection application for Wireless Sensor Networks. We study the problem of scheduling the sleep times of the individual sensors, where the objective is to maximize the network lifetime while keeping the tracking error to a minimum. We formulate this problem as a partially-observable Markov decision process (POMDP) with continuous state-action spaces, in a manner similar to Fuemmeler and Veeravalli (IEEE Trans Signal Process 56(5), 2091–2101, 2008). However, unlike their formulation, we consider infinite horizon discounted and average cost objectives as performance criteria. For each criterion, we propose a convergent on-policy Q-learning algorithm that operates on two timescales, while employing function approximation. Feature-based representations and function approximation is necessary to handle the curse of dimensionality associated with the underlying POMDP. Our proposed algorithm incorporates a policy gradient update using a one-simulation simultaneous perturbation stochastic approximation estimate on the faster timescale, while the Q-value parameter (arising from a linear function approximation architecture for the Q-values) is updated in an on-policy temporal difference algorithm-like fashion on the slower timescale. The feature selection scheme employed in each of our algorithms manages the energy and tracking components in a manner that assists the search for the optimal sleep-scheduling policy. For the sake of comparison, in both discounted and average settings, we also develop a function approximation analogue of the Q-learning algorithm. This algorithm, unlike the two-timescale variant, does not possess theoretical convergence guarantees. Finally, we also adapt our algorithms to include a stochastic iterative estimation scheme for the intruder’s mobility model and this is useful in settings where the latter is not known. Our simulation results on a synthetic 2-dimensional network setting suggest that our algorithms result in better tracking accuracy at the cost of only a few additional sensors, in comparison to a recent prior work.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
A short version of this paper containing only the average cost setting and algorithms and with no proofs is available in [24]. The current paper includes in addition: (i) algorithms for the discounted cost setting; (ii) a detailed proof of convergence of the average cost algorithm using theory of stochastic recursive inclusions; and (iii) detailed numerical experiments.
 
2
Since we study long-run average sum of (2) (see (4) below), we can consider the problem of tracking an intruder in an infinite horizon, whereas a termination state in [14] was made necessary as they considered a total cost objective.
 
3
A simple rule to choose a state \(s\) such that there is a positive probability of the underlying MDP visiting \(s\). Such a criterion ensures that the term (II) of (11) converges to the optimal average cost \(J^*\).
 
4
One may use an \(\epsilon\)-greedy policy for TQSA-A as well, however, that will result in additional exploration. Since TQSA-A updates the parameters of an underlying parameterized Boltzmann policy (which by itself is randomized in nature), we do not need an extra exploration step in our algorithm.
 
5
This is because the intruder stays in the starting location for at least one time step and the exploration of actions initially results in a positive probability of a random action being chosen.
 
Literatur
1.
Zurück zum Zitat Abounadi, J., Bertsekas, D., & Borkar, V. (2002). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3), 681–698.MathSciNetCrossRef Abounadi, J., Bertsekas, D., & Borkar, V. (2002). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3), 681–698.MathSciNetCrossRef
2.
Zurück zum Zitat Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In: ICML, pp 30–37. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In: ICML, pp 30–37.
3.
Zurück zum Zitat Beccuti, M., Codetta-Raiteri, D., & Franceschinis, G. (2009). Multiple abstraction levels in performance analysis of wsn monitoring systems. In: International ICST conference on performance evaluation methodologies and tools, p. 73. Beccuti, M., Codetta-Raiteri, D., & Franceschinis, G. (2009). Multiple abstraction levels in performance analysis of wsn monitoring systems. In: International ICST conference on performance evaluation methodologies and tools, p. 73.
4.
Zurück zum Zitat Bertsekas, D. P. (2007). Dynamic programming and optimal control (3rd ed., Vol. II). Belmont: Athena Scientific. Bertsekas, D. P. (2007). Dynamic programming and optimal control (3rd ed., Vol. II). Belmont: Athena Scientific.
5.
Zurück zum Zitat Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.MATH Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.MATH
7.
Zurück zum Zitat Bhatnagar, S., Fu, M., Marcus, S., & Wang, I. (2003). Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2), 180–209.CrossRef Bhatnagar, S., Fu, M., Marcus, S., & Wang, I. (2003). Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2), 180–209.CrossRef
8.
Zurück zum Zitat Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11), 2471–2482.MathSciNetCrossRefMATH Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11), 2471–2482.MathSciNetCrossRefMATH
9.
Zurück zum Zitat Bhatnagar, S., Prasad, H., & Prashanth, L. (2013). Stochastic recursive algorithms for optimization (Vol. 434). New York: Springer. Bhatnagar, S., Prasad, H., & Prashanth, L. (2013). Stochastic recursive algorithms for optimization (Vol. 434). New York: Springer.
10.
Zurück zum Zitat Borkar, V. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press. Borkar, V. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.
11.
Zurück zum Zitat Cui, Y., Lau, V. K., Wang, R., Huang, H., & Zhang, S. (2012a). A survey on delay-aware resource control for wireless systemsLarge deviation theory, stochastic lyapunov drift, and distributed stochastic learning. IEEE Transactions on Information Theory, 58(3), 1677–1701.MathSciNetCrossRef Cui, Y., Lau, V. K., Wang, R., Huang, H., & Zhang, S. (2012a). A survey on delay-aware resource control for wireless systemsLarge deviation theory, stochastic lyapunov drift, and distributed stochastic learning. IEEE Transactions on Information Theory, 58(3), 1677–1701.MathSciNetCrossRef
12.
Zurück zum Zitat Cui, Y., Lau, V. K., & Wu, Y. (2012b). Delay-aware BS discontinuous transmission control and user scheduling for energy harvesting downlink coordinated MIMO systems. IEEE Transactions on Signal Processing, 60(7), 3786–3795.MathSciNetCrossRef Cui, Y., Lau, V. K., & Wu, Y. (2012b). Delay-aware BS discontinuous transmission control and user scheduling for energy harvesting downlink coordinated MIMO systems. IEEE Transactions on Signal Processing, 60(7), 3786–3795.MathSciNetCrossRef
13.
Zurück zum Zitat Fu, F., & van der Schaar, M. (2009). Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology, 58(4), 1904–1919.CrossRef Fu, F., & van der Schaar, M. (2009). Learning to compete for resources in wireless stochastic games. IEEE Transactions on Vehicular Technology, 58(4), 1904–1919.CrossRef
14.
Zurück zum Zitat Fuemmeler, J., & Veeravalli, V. (2008). Smart sleeping policies for energy efficient tracking in sensor networks. IEEE Transactions on Signal Processing, 56(5), 2091–2101.MathSciNetCrossRef Fuemmeler, J., & Veeravalli, V. (2008). Smart sleeping policies for energy efficient tracking in sensor networks. IEEE Transactions on Signal Processing, 56(5), 2091–2101.MathSciNetCrossRef
15.
Zurück zum Zitat Fuemmeler, J., Atia, G., & Veeravalli, V. (2011). Sleep control for tracking in sensor networks. IEEE Transactions on Signal Processing, 59(9), 4354–4366.MathSciNetCrossRef Fuemmeler, J., Atia, G., & Veeravalli, V. (2011). Sleep control for tracking in sensor networks. IEEE Transactions on Signal Processing, 59(9), 4354–4366.MathSciNetCrossRef
16.
Zurück zum Zitat Gui, C., & Mohapatra, P. (2004). Power conservation and quality of surveillance in target tracking sensor networks. In: Proceedings of the international conference on mobile computing and networking, pp. 129–143. Gui, C., & Mohapatra, P. (2004). Power conservation and quality of surveillance in target tracking sensor networks. In: Proceedings of the international conference on mobile computing and networking, pp. 129–143.
17.
Zurück zum Zitat Jiang, B., Han, K., Ravindran, B., & Cho, H. (2008). Energy efficient sleep scheduling based on moving directions in target tracking sensor network. In: IEEE international symposium on parallel and distributed processing, pp. 1–10. Jiang, B., Han, K., Ravindran, B., & Cho, H. (2008). Energy efficient sleep scheduling based on moving directions in target tracking sensor network. In: IEEE international symposium on parallel and distributed processing, pp. 1–10.
18.
Zurück zum Zitat Jianlin, M., Fenghong, X., & Hua, L. (2009). RL-based superframe order adaptation algorithm for IEEE 802.15.4 networks. In: Chinese control and decision conference, IEEE, pp. 4708–4711. Jianlin, M., Fenghong, X., & Hua, L. (2009). RL-based superframe order adaptation algorithm for IEEE 802.15.4 networks. In: Chinese control and decision conference, IEEE, pp. 4708–4711.
19.
Zurück zum Zitat Jin Gy, Lu, & Xy, Park M. S. (2006). Dynamic clustering for object tracking in wireless sensor networks. Ubiquitous Computing Systems, 4239, 200–209.CrossRef Jin Gy, Lu, & Xy, Park M. S. (2006). Dynamic clustering for object tracking in wireless sensor networks. Ubiquitous Computing Systems, 4239, 200–209.CrossRef
20.
Zurück zum Zitat Khan, M. I., & Rinner, B. (2012). Resource coordination in wireless sensor networks by cooperative reinforcement learning. In: IEEE international conference on pervasive computing and communications workshop, pp. 895–900. Khan, M. I., & Rinner, B. (2012). Resource coordination in wireless sensor networks by cooperative reinforcement learning. In: IEEE international conference on pervasive computing and communications workshop, pp. 895–900.
21.
Zurück zum Zitat Konda, V. R., & Tsitsiklis, J. N. (2004) Convergence rate of linear two-time-scale stochastic approximation. Annals of applied probability, pp. 796–819. Konda, V. R., & Tsitsiklis, J. N. (2004) Convergence rate of linear two-time-scale stochastic approximation. Annals of applied probability, pp. 796–819.
22.
Zurück zum Zitat Liu, Z., & Elhanany, I. (2006). RL-MAC: A QoS-aware reinforcement learning based MAC protocol for wireless sensor networks. IEEE International Conference on Networking (pp. 768–773). IEEE: Sensing and Control. Liu, Z., & Elhanany, I. (2006). RL-MAC: A QoS-aware reinforcement learning based MAC protocol for wireless sensor networks. IEEE International Conference on Networking (pp. 768–773). IEEE: Sensing and Control.
23.
Zurück zum Zitat Niu, J. (2010) Self-learning scheduling approach for wireless sensor network. In: International conference on future computer and communication (ICFCC), IEEE, Vol. 3, pp. 253–257. Niu, J. (2010) Self-learning scheduling approach for wireless sensor network. In: International conference on future computer and communication (ICFCC), IEEE, Vol. 3, pp. 253–257.
24.
Zurück zum Zitat Prashanth, L., Chatterjee, A., & Bhatnagar, S. (2014). Adaptive sleep-wake control using reinforcement learning in sensor networks. In: 6th international conference on communication systems and networks (COMSNETS), IEEE. Prashanth, L., Chatterjee, A., & Bhatnagar, S. (2014). Adaptive sleep-wake control using reinforcement learning in sensor networks. In: 6th international conference on communication systems and networks (COMSNETS), IEEE.
25.
Zurück zum Zitat Prashanth, L. A., & Bhatnagar, S. (2011a). Reinforcement learning with average cost for adaptive control of traffic lights at intersections. In: 14th International IEEE conference on intelligent transportation systems (ITSC), pp. 1640–1645. Prashanth, L. A., & Bhatnagar, S. (2011a). Reinforcement learning with average cost for adaptive control of traffic lights at intersections. In: 14th International IEEE conference on intelligent transportation systems (ITSC), pp. 1640–1645.
26.
Zurück zum Zitat Prashanth, L. A., & Bhatnagar, S. (2011b). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421.CrossRef Prashanth, L. A., & Bhatnagar, S. (2011b). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421.CrossRef
27.
Zurück zum Zitat Premkumar, K., & Kumar, A. (2008). Optimal sleep-wake scheduling for quickest intrusion detection using sensor networks. Arizona, USA: IEEE INFOCOM. Premkumar, K., & Kumar, A. (2008). Optimal sleep-wake scheduling for quickest intrusion detection using sensor networks. Arizona, USA: IEEE INFOCOM.
28.
Zurück zum Zitat Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRefMATH Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRefMATH
29.
Zurück zum Zitat Rucco, L., Bonarini, A., Brandolese, C., & Fornaciari, W. (2013). A bird’s eye view on reinforcement learning approaches for power management in WSNs. In: Wireless and mobile networking conference (WMNC), IEEE, pp. 1–8. Rucco, L., Bonarini, A., Brandolese, C., & Fornaciari, W. (2013). A bird’s eye view on reinforcement learning approaches for power management in WSNs. In: Wireless and mobile networking conference (WMNC), IEEE, pp. 1–8.
30.
Zurück zum Zitat Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332–341.MathSciNetCrossRefMATH Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332–341.MathSciNetCrossRefMATH
31.
Zurück zum Zitat Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: Cambridge University Press. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: Cambridge University Press.
32.
Zurück zum Zitat Tsitsiklis, J. N., & Van Roy, B. (1997). An Analysis of Temporal Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5), 674–690. Tsitsiklis, J. N., & Van Roy, B. (1997). An Analysis of Temporal Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
33.
Zurück zum Zitat Watkins, C., & Dayan, P. (1992). Machine learning. Q-learning, 8(3), 279–292.MATH Watkins, C., & Dayan, P. (1992). Machine learning. Q-learning, 8(3), 279–292.MATH
Metadaten
Titel
Two timescale convergent Q-learning for sleep-scheduling in wireless sensor networks
verfasst von
L. A. Prashanth
Abhranil Chatterjee
Shalabh Bhatnagar
Publikationsdatum
01.11.2014
Verlag
Springer US
Erschienen in
Wireless Networks / Ausgabe 8/2014
Print ISSN: 1022-0038
Elektronische ISSN: 1572-8196
DOI
https://doi.org/10.1007/s11276-014-0762-6

Weitere Artikel der Ausgabe 8/2014

Wireless Networks 8/2014 Zur Ausgabe

Neuer Inhalt