Skip to main content

2018 | OriginalPaper | Buchkapitel

A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits

verfasst von : Negar Hassanpour, Russell Greiner

Erschienen in: Advances in Artificial Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use data from any given Randomized Control Trial (RCT) to generate a range of observational studies with synthesized “outcome functions” that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given learning method. This is especially important in evaluating methods developed for precision medicine, where deploying a bad policy can have devastating effects. As the outcome function specifies the real-valued quality of any treatment for any instance, we can accurately compute the quality of any proposed treatment policy. This paper uses this evaluation methodology to establish a common ground for comparing the robustness and performance of the available off-policy learning methods in the literature.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
This is not one-hot encoding as there may be instances with multiple associated labels – e.g., a news article concerning political initiatives on climate change.
 
2
Note that the test set remains intact for evaluating the learned policy.
 
3
This means the X values are realistic. By contrast, we do not know whether the X values from a supervised dataset look like realistic [medical] observational studies.
 
4
A low \(R^2\) measure suggests that there must exist [some] unobserved confounder(s) that [significantly] contribute to the outcome.
 
5
Our implementation of IPS (and SN below) is obtained from Policy Optimizer for Exponential Models (POEM [19]). We extended POEM substantially to include a way to deal with the missing components (i.e., OP and DR), as well as implementation of the proposed evaluation methodology.
 
Literatur
2.
Zurück zum Zitat Imbens, G.W., Rubin, D.B.: Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York (2015)CrossRefMATH Imbens, G.W., Rubin, D.B.: Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York (2015)CrossRefMATH
3.
Zurück zum Zitat Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)MathSciNetCrossRefMATH Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)MathSciNetCrossRefMATH
4.
Zurück zum Zitat Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998) Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
5.
Zurück zum Zitat Bottou, L., Peters, J., Candela, J.Q., Charles, D.X., Chickering, M., Portugaly, E., Ray, D., Simard, P.Y., Snelson, E.: Counterfactual reasoning and learning systems: the example of computational advertising. JMLR 14(1), 3207–3260 (2013)MathSciNetMATH Bottou, L., Peters, J., Candela, J.Q., Charles, D.X., Chickering, M., Portugaly, E., Ray, D., Simard, P.Y., Snelson, E.: Counterfactual reasoning and learning systems: the example of computational advertising. JMLR 14(1), 3207–3260 (2013)MathSciNetMATH
6.
Zurück zum Zitat Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics in search engines: a case study. In: Proceedings of the 24th International Conference on World Wide Web. ACM (2015) Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics in search engines: a case study. In: Proceedings of the 24th International Conference on World Wide Web. ACM (2015)
7.
Zurück zum Zitat Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48 (2016) Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48 (2016)
8.
Zurück zum Zitat Liu, Y.E., Mandel, T., Brunskill, E., Popovic, Z.: Trading off scientific knowledge and user learning with multi-armed bandits. In: Educational Data Mining (2014) Liu, Y.E., Mandel, T., Brunskill, E., Popovic, Z.: Trading off scientific knowledge and user learning with multi-armed bandits. In: Educational Data Mining (2014)
9.
Zurück zum Zitat Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM (2010) Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM (2010)
10.
Zurück zum Zitat Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: Proceedings of the 4th International Conference on Web Search and Data Mining, Hong Kong (2011) Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: Proceedings of the 4th International Conference on Web Search and Data Mining, Hong Kong (2011)
11.
Zurück zum Zitat Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRefMATH Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)MathSciNetCrossRefMATH
12.
Zurück zum Zitat Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)MathSciNetCrossRefMATH Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)MathSciNetCrossRefMATH
13.
Zurück zum Zitat Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: International Conference on Machine Learning (2011) Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: International Conference on Machine Learning (2011)
14.
Zurück zum Zitat Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)MathSciNetCrossRefMATH Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)MathSciNetCrossRefMATH
15.
Zurück zum Zitat Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems (2015) Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems (2015)
16.
Zurück zum Zitat Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4), 1161–1189 (2003)MathSciNetCrossRefMATH Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4), 1161–1189 (2003)MathSciNetCrossRefMATH
17.
Zurück zum Zitat Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning (2015) Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning (2015)
18.
Zurück zum Zitat Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD. ACM (2009) Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD. ACM (2009)
19.
Zurück zum Zitat Swaminathan, A., Joachims, T.: Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR 16, 1731–1755 (2015)MathSciNetMATH Swaminathan, A., Joachims, T.: Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR 16, 1731–1755 (2015)MathSciNetMATH
20.
Zurück zum Zitat Rasmussen, C.E., Williams, C.K.: Gaussian Processes for Machine Learning, vol. 1. MIT Press, Cambridge (2006)MATH Rasmussen, C.E., Williams, C.K.: Gaussian Processes for Machine Learning, vol. 1. MIT Press, Cambridge (2006)MATH
21.
Zurück zum Zitat Vickers, A.J., Rees, R.W., Zollman, C.E., McCarney, R., Smith, C.M., Ellis, N., Fisher, P., Van Haselen, R.: Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ 328(7442), 744 (2004)CrossRef Vickers, A.J., Rees, R.W., Zollman, C.E., McCarney, R., Smith, C.M., Ellis, N., Fisher, P., Van Haselen, R.: Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ 328(7442), 744 (2004)CrossRef
22.
Zurück zum Zitat Vickers, A.J.: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7(1), 15 (2006)CrossRef Vickers, A.J.: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7(1), 15 (2006)CrossRef
23.
Zurück zum Zitat Hypericum Depression Trial Study Group, et al.: Effect of Hypericum perforatum (St. John’s Wort) in major depressive disorder: a randomized controlled trial. JAMA 287(14), 1807–1814 (2002) Hypericum Depression Trial Study Group, et al.: Effect of Hypericum perforatum (St. John’s Wort) in major depressive disorder: a randomized controlled trial. JAMA 287(14), 1807–1814 (2002)
Metadaten
Titel
A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits
verfasst von
Negar Hassanpour
Russell Greiner
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-89656-4_3