nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Harnessing Hundreds of Millions of Cases: Case-Based Prediction at Industrial Scale

verfasst von : Vahid Jalali, David Leake

Erschienen in: Case-Based Reasoning Research and Development

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Building predictive models is central to many big data applications. However, model building is computationally costly at scale. An appealing alternative is bypassing model building by applying case-based prediction to reason directly from data. However, to our knowledge case-based prediction still has not been applied at true industrial scale. In previous work we introduced a knowledge-light/data intensive approach to case-based prediction, using ensembles of automatically-generated adaptations. We developed foundational scaleup methods, using Locality Sensitive Hashing (LSH) for fast approximate nearest neighbor retrieval of both cases and adaptation rules, and tested them for millions of cases. This paper presents research on extending these methods to address the practical challenges raised by case bases of hundreds of millions of cases for a real world industrial e-commerce application. Handling this application required addressing how to keep LSH practical for skewed data; the resulting efficiency gains in turn enabled applying an adaptation generation strategy that previously was computationally infeasible. Experimental results show that our CBR approach achieves accuracy comparable to or better than state of the art machine learning methods commonly applied, while avoiding their model-building cost. This supports the opportunity to harness CBR for industrial scale prediction.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A Textual Recommender System for Clinical Data

Nächstes Kapitel Case Base Elicitation for a Context-Aware Recommender System

https://www.kaggle.com/zurfer/rtb/data.

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K. (ed.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52CrossRef

Beaver, I., Dumoulin, J.: Applying mapreduce to learning user preferences in near real-time. In: Delany, S.J., Ontañón, S. (eds.) ICCBR 2013. LNCS, vol. 7969, pp. 15–28. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39056-2_2CrossRef

Bi, Z., Faloutsos, C., Korn, F.: The “DGX” distribution for mining massive, skewed data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 17–26. ACM, New York (2001)

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–186. Physica-Verlag HD, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16CrossRef

Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer, Boston (2010). https://doi.org/10.1007/0-387-25465-X_40

Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 2004, pp. 253–262. ACM, New York (2004)

Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)

Hanney, K., Keane, M.T.: Learning adaptation rules from a case-base. In: Smith, I., Faltings, B. (eds.) EWCBR 1996. LNCS, vol. 1168, pp. 179–192. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0020610CrossRef

Houeland, T.G., Aamodt, A.: The utility problem for lazy learners - towards a non-eager approach. In: Bichindaritz, I., Montani, S. (eds.) ICCBR 2010. LNCS (LNAI), vol. 6176, pp. 141–155. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14274-1_12CrossRef

10.

Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)

11.

Jalali, V., Leake, D.: CBR meets big data: a case study of large-scale adaptation rule generation. In: Hüllermeier, E., Minor, M. (eds.) ICCBR 2015. LNCS, vol. 9343, pp. 181–196. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24586-7_13CrossRef

12.

Jalali, V., Leake, D.: Scaling up ensemble of adaptations for classification by approximate nearest neighbor retrieval. In: Aha, D.W., Lieber, J. (eds.) ICCBR 2017. LNCS (LNAI), vol. 10339, pp. 154–169. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61030-6_11CrossRef

13.

Jalali, V., Leake, D., Forouzandehmehr, N.: Ensemble of adaptations for classification: learning adaptation rules for categorical features. In: Goel, A., Díaz-Agudo, M.B., Roth-Berghofer, T. (eds.) ICCBR 2016. LNCS, vol. 9969, pp. 186–202. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47096-2_13CrossRef

14.

Jalali, V., Leake, D.: A context-aware approach to selecting adaptations for case-based reasoning. In: Brézillon, P., Blackburn, P., Dapoigny, R. (eds.) CONTEXT 2013. LNCS, vol. 8175, pp. 101–114. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40972-1_8CrossRef

15.

Jalali, V., Leake, D.: Extending case adaptation with automatically-generated ensembles of adaptation rules. In: Delany, S.J., Ontañón, S. (eds.) ICCBR 2013. LNCS, vol. 7969, pp. 188–202. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39056-2_14CrossRef

16.

Jalali, V., Leake, D.: Adaptation-guided case base maintenance. In: Proceedings of the Twenty-Eighth Conference on Artificial Intelligence, pp. 1875–1881. AAAI Press (2014)

17.

Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: IEEE International Conference on Computer Vision ICCV (2009)

18.

Leake, D., Smyth, B., Wilson, D., Yang, Q. (eds.): Maintaining Case-Based Reasoning Systems. Blackwell, Malden (2001). Special issue of Computational Intelligence 17(2) (2001)

19.

Leetaru, K., Schrodt, P.A.: GDELT: global data on events, location, and tone. ISA Annual Convention (2013)

20.

Lin, Y.B., Ping, X.O., Ho, T.W., Lai, F.: Processing and analysis of imbalanced liver cancer patient data by case-based reasoning. In: The 7th 2014 Biomedical Engineering International Conference, pp. 1–5, November 2014

21.

Malof, J., Mazurowski, M., Tourassi, G.: The effect of class imbalance on case selection for case-based classifiers: an empirical study in the context of medical decision support. Neural Netw. 25, 141–145 (2012)CrossRef

22.

Meng, X., et al.: MLlib: machine learning in apache spark. CoRR abs/1505.06807 (2015)

23.

Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)

24.

Ontañón, S., Plaza, E.: Collaborative case retention strategies for CBR agents. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 392–406. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45006-8_31CrossRefMATH

25.

Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering. SIGMOD Rec. 29(2), 82–92 (2000)CrossRef

26.

Rojas, J.A.R., Kery, M.B., Rosenthal, S., Dey, A.: Sampling techniques to improve big data exploration. In: 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), pp. 26–35, October 2017

27.

Salamó, M., López-Sánchez, M.: Adaptive case-based reasoning using retention and forgetting strategies. Knowl. Based Syst. 24(2), 230–247 (2011)CrossRef

28.

Smyth, B., Cunningham, P.: The utility problem analysed. In: Smith, I., Faltings, B. (eds.) EWCBR 1996. LNCS, vol. 1168, pp. 392–399. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0020625CrossRef

29.

Smyth, B., Keane, M.: Remembering to forget: a competence-preserving case deletion policy for case-based reasoning systems. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 377–382. Morgan Kaufmann, San Mateo (1995)

30.

Smyt, B., McKenna, E.: Footprint-based retrieval. In: Althoff, K.-D., Bergmann, R., Branting, L.K. (eds.) ICCBR 1999. LNCS, vol. 1650, pp. 343–357. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48508-2_25CrossRef

31.

Upadhyaya, S.R.: Parallel approaches to machine learning a comprehensive survey. J. Parallel Distrib. Comput. 73(3), 284–292 (2013). Models and Algorithms for High-Performance Distributed Data Mining

32.

Zhu, J., Yang, Q.: Remembering to add: competence-preserving case-addition policies for case base maintenance. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pp. 234–241. Morgan Kaufmann (1999)

Titel: Harnessing Hundreds of Millions of Cases: Case-Based Prediction at Industrial Scale
verfasst von: Vahid Jalali
David Leake
Verlag: Springer International Publishing
Buch: Case-Based Reasoning Research and Development
Print ISBN: 978-3-030-01080-5

Electronic ISBN: 978-3-030-01081-2

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-030-01081-2_11

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"