Skip to main content
Top
Published in: Information Systems Frontiers 5/2017

26-07-2016

Automatic classification of data-warehouse-data for information lifecycle management using machine learning techniques

Authors: Sebastian Büsch, Volker Nissen, Arndt Wünscher

Published in: Information Systems Frontiers | Issue 5/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The aim of Information Lifecycle Management (ILM) is to govern data throughout its lifecycle as efficiently as possible and effectively from technical points of view. A core aspect is the question, where the data should be stored, since different costs and access times are entailed. For this purpose data have to be classified, which presently is either done manually in an elaborate way, or with recourse to only a few data attributes, in particular access frequency. In the context of Data-Warehouse-Systems this article introduces an automated and therefore speedy and cost-effective data classification for ILM. Machine learning techniques, in particular an artificial neural network (multilayer perceptron), a support vector machine and a decision tree approach are compared on an SAP-based real-world data set from the automotive industry. This data classification considers a large number of data attributes and thus attains similar results akin to human experts. In this comparison of machine learning techniques, besides the accuracy of classification, also the types of misclassification that appear, are included, since this is important in ILM.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Al-Madi, N., & Ludwig, S. (2012). Adaptive genetic programming applied to classification in Data-Mining. In Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC) (pp. 79–85). Mexico City. Al-Madi, N., & Ludwig, S. (2012). Adaptive genetic programming applied to classification in Data-Mining. In Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC) (pp. 79–85). Mexico City.
go back to reference Anandarajan, M., Anandarajan, A., & Srinivasan, C.A. (2004). Business intelligence techniques. A perspective from accounting and finance. Berlin: Springer. Anandarajan, M., Anandarajan, A., & Srinivasan, C.A. (2004). Business intelligence techniques. A perspective from accounting and finance. Berlin: Springer.
go back to reference Bauer, A., & Günzel, H. (2011). Data-Warehouse Systeme - Architektur, Entwicklung Anwendung, DPunkt, Heidelberg. Bauer, A., & Günzel, H. (2011). Data-Warehouse Systeme - Architektur, Entwicklung Anwendung, DPunkt, Heidelberg.
go back to reference Bhagwan, R., Douglis, F., Hildrum, K., Kephart, J.O., & Walsh, W.E. (2005). Time-varying management of data storage. In Proceedings 1st workshop on hot topics in system dependability. Berkeley: USENIX Association. Bhagwan, R., Douglis, F., Hildrum, K., Kephart, J.O., & Walsh, W.E. (2005). Time-varying management of data storage. In Proceedings 1st workshop on hot topics in system dependability. Berkeley: USENIX Association.
go back to reference Born, S., Ehmann, S., Hintemann, R., Kastenmüller, S., Schaupp, D., & Stahl, H. (2004). Leitfaden zum Thema “Information Lifecycle Management”. Born, S., Ehmann, S., Hintemann, R., Kastenmüller, S., Schaupp, D., & Stahl, H. (2004). Leitfaden zum Thema “Information Lifecycle Management”.
go back to reference Chakchai, S., Mongkonchai, N., Aimtongkham, P., Wijitsopon, K., & Rujirakul, K. (2014). An evaluation of data-mining classification models for network intrusion detection. In 4th international conference on digital information and communication technology and it’s applications (pp. 90–94). Bankok. Chakchai, S., Mongkonchai, N., Aimtongkham, P., Wijitsopon, K., & Rujirakul, K. (2014). An evaluation of data-mining classification models for network intrusion detection. In 4th international conference on digital information and communication technology and it’s applications (pp. 90–94). Bankok.
go back to reference Chawla, N.V. (2010). Data-mining for imbalanced datasets: an overview. In Rokach, L., & Maimon, O. (Eds.) Data-mining and knowledge discovery handbook (pp. 875–886). New York: Springer. Chawla, N.V. (2010). Data-mining for imbalanced datasets: an overview. In Rokach, L., & Maimon, O. (Eds.) Data-mining and knowledge discovery handbook (pp. 875–886). New York: Springer.
go back to reference Chen, Y. (2005). Information valuation for information lifecycle management. In Proceedings of the 2nd international conference on automatic computing (ICAC ’05) (pp. 135–146). Washington DC. Chen, Y. (2005). Information valuation for information lifecycle management. In Proceedings of the 2nd international conference on automatic computing (ICAC ’05) (pp. 135–146). Washington DC.
go back to reference Chizi, B., & Maimon, O. (2010). Dimension reduction and feature selection. In Rokach, L., & Maimon, O. (Eds.) Data-mining and knowledge discovery handbook (pp. 83–100). New York: Springer. Chizi, B., & Maimon, O. (2010). Dimension reduction and feature selection. In Rokach, L., & Maimon, O. (Eds.) Data-mining and knowledge discovery handbook (pp. 83–100). New York: Springer.
go back to reference Durand, J., & Atkison, T. (2012). Applying random projection to the classification of malicious applications using data-mining algorithms. In Proceedings of the 50th annual southeast regional conference. Tuscaloosa (pp. 286–291). Durand, J., & Atkison, T. (2012). Applying random projection to the classification of malicious applications using data-mining algorithms. In Proceedings of the 50th annual southeast regional conference. Tuscaloosa (pp. 286–291).
go back to reference Egan, J. (1975). Signal detection theory and ROC-analysis. London: Academic Press. Egan, J. (1975). Signal detection theory and ROC-analysis. London: Academic Press.
go back to reference Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data-mining to knowledge discovery in databases. AI Magazine, 17(3), 37–54. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data-mining to knowledge discovery in databases. AI Magazine, 17(3), 37–54.
go back to reference Gabriel, R., Gluchowski, S., & Pastwas, A. (2011). Data-Warehouse und data-mining. Dortmund: W3L Verlag. Gabriel, R., Gluchowski, S., & Pastwas, A. (2011). Data-Warehouse und data-mining. Dortmund: W3L Verlag.
go back to reference Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70, 320–328.CrossRef Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70, 320–328.CrossRef
go back to reference Glazer, R. (1993). Measuring the value of information: the information-intensive organization. In IBM System Journal, (Vol. 23 pp. 99–110). Glazer, R. (1993). Measuring the value of information: the information-intensive organization. In IBM System Journal, (Vol. 23 pp. 99–110).
go back to reference Han, J., Kamber, M., & Pei, J. (2011). Data-mining: concepts and techniques, 3rd edn. Waltham: Morgan Kaufman. Han, J., Kamber, M., & Pei, J. (2011). Data-mining: concepts and techniques, 3rd edn. Waltham: Morgan Kaufman.
go back to reference Heinrich, L.J., & Stelzer, D. (2009). Informationsmanagement. 9 Aufl. Munich: Oldenbourg. Heinrich, L.J., & Stelzer, D. (2009). Informationsmanagement. 9 Aufl. Munich: Oldenbourg.
go back to reference Inmon, W.H. (1993). Building the data-warehouse, 1st edn. New York: Wiley. Inmon, W.H. (1993). Building the data-warehouse, 1st edn. New York: Wiley.
go back to reference Jantan, H., Hamdan, A., & Othman, Z. (2011). Talent knowledge acquisition using Data-Mining classification techniques. In 3rd conference on data-mining and optimization. Selangor (pp. 32–37). Jantan, H., Hamdan, A., & Othman, Z. (2011). Talent knowledge acquisition using Data-Mining classification techniques. In 3rd conference on data-mining and optimization. Selangor (pp. 32–37).
go back to reference Kaiser, M.G., Smolnik, S., & Riempp, G. (2008). Verbesserte Compliance durch Information Lifecycle Management. In HMD - Praxis der Wirtschaftsinformatik, (Vol. 45(5) pp. 30–38). Kaiser, M.G., Smolnik, S., & Riempp, G. (2008). Verbesserte Compliance durch Information Lifecycle Management. In HMD - Praxis der Wirtschaftsinformatik, (Vol. 45(5) pp. 30–38).
go back to reference Kezih, M., & Taibi, M. (2013). Evaluation effectiveness of intrusion detection system with reduced dimension using Data-Mining classification tools. In 2nd international conference on systems and computer science (ICSCS) 2013. Villeneuve d’Ascq (pp. 205–209). Kezih, M., & Taibi, M. (2013). Evaluation effectiveness of intrusion detection system with reduced dimension using Data-Mining classification tools. In 2nd international conference on systems and computer science (ICSCS) 2013. Villeneuve d’Ascq (pp. 205–209).
go back to reference Khoshgoftaar, T., Gao, K., & Van Hulse, J. (2010). A novel feature selection technique for highly imbalanced data. In IEEE international conference on information reuse and integration (pp. 80–85). Las Vegas. Khoshgoftaar, T., Gao, K., & Van Hulse, J. (2010). A novel feature selection technique for highly imbalanced data. In IEEE international conference on information reuse and integration (pp. 80–85). Las Vegas.
go back to reference Kohavi, J. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on artificial intelligence, (Vol. 2 pp. 1137–1143). San Francisco. Kohavi, J. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on artificial intelligence, (Vol. 2 pp. 1137–1143). San Francisco.
go back to reference Kosler, M., Matthesius, M., & Stelzer, D. (2008). Ein Konzept zur automatsierten Klassifizierung von Informationen für das Information Lifecycle Management. In Dinter, B., Winter, R., Chamoni, P., Gronau, N., & Turowski, K. (Eds.) Data warehouse 2008. GI (LNI 138), Bonn (pp. 129–146). Kosler, M., Matthesius, M., & Stelzer, D. (2008). Ein Konzept zur automatsierten Klassifizierung von Informationen für das Information Lifecycle Management. In Dinter, B., Winter, R., Chamoni, P., Gronau, N., & Turowski, K. (Eds.) Data warehouse 2008. GI (LNI 138), Bonn (pp. 129–146).
go back to reference Kotecha, R., Ukani, V., & Garg, S. (2011). An empirical analysis of multiclass classification techniques in data-mining. In Nirma University International Conference on Engineering (NUiCONE) 2011. Ahmedabad (pp. 1–5). Kotecha, R., Ukani, V., & Garg, S. (2011). An empirical analysis of multiclass classification techniques in data-mining. In Nirma University International Conference on Engineering (NUiCONE) 2011. Ahmedabad (pp. 1–5).
go back to reference Kotsiantis, S., Zaharakis, I.D., & Pintelas, S.E. (2007). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26, 159–190. Kotsiantis, S., Zaharakis, I.D., & Pintelas, S.E. (2007). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26, 159–190.
go back to reference Lilienthal, M. (2013). A decision support model for cloud bursting. In Business & Information Systems Engineering, (Vol. 55(2) pp. 71–81). Lilienthal, M. (2013). A decision support model for cloud bursting. In Business & Information Systems Engineering, (Vol. 55(2) pp. 71–81).
go back to reference Liu, B., Li, J., & Zhang, Y (2004). Optimal data dispatching methods in near-line tertiary storage system. In Proceedings of the 5th international conference on advances in web-age information management (pp. 690–695). Dalian. Liu, B., Li, J., & Zhang, Y (2004). Optimal data dispatching methods in near-line tertiary storage system. In Proceedings of the 5th international conference on advances in web-age information management (pp. 690–695). Dalian.
go back to reference Liu, H., Wang, X., & Quan, Q. (2009). Research on the enterprise model of information lifecycle management. In Proceedings of the 9th international conference on hybrid intelligent systems (pp. 165–169). Liu, H., Wang, X., & Quan, Q. (2009). Research on the enterprise model of information lifecycle management. In Proceedings of the 9th international conference on hybrid intelligent systems (pp. 165–169).
go back to reference Loos, P., Lechtenbörger, J., Vossen, G., Zeier, A., Krüger, J., & Müller, J. (2011). In-memory databases in business information systems. In Business & Information Systems Engineering, (Vol. 53(2) pp. 389–395). Loos, P., Lechtenbörger, J., Vossen, G., Zeier, A., Krüger, J., & Müller, J. (2011). In-memory databases in business information systems. In Business & Information Systems Engineering, (Vol. 53(2) pp. 389–395).
go back to reference Lusti, M. (2001). Data Warehousing und Data-Mining: Eine Einführung in entscheidungsunterstützende Systeme, 2nd edn. Heidelberg: Springer. Lusti, M. (2001). Data Warehousing und Data-Mining: Eine Einführung in entscheidungsunterstützende Systeme, 2nd edn. Heidelberg: Springer.
go back to reference Maier, R., Hädrich, T., & Peinl, R. (2005). Enterprise knowledge infrastructures. Heidelberg: Springer. Maier, R., Hädrich, T., & Peinl, R. (2005). Enterprise knowledge infrastructures. Heidelberg: Springer.
go back to reference Matthesius, M., & Stelzer, D. (2008). Analyse und Vergleich von Konzepten zur automatisierten Informationsbewertung im ILM. In Bichler, M (Ed.) Proceedigns of the MKWI 2008 (pp. 471–482). Berlin. Matthesius, M., & Stelzer, D. (2008). Analyse und Vergleich von Konzepten zur automatisierten Informationsbewertung im ILM. In Bichler, M (Ed.) Proceedigns of the MKWI 2008 (pp. 471–482). Berlin.
go back to reference Moody, D., & Walsh, P. (1999). Measuring the value of information: an asset valuation approach. In 7th european conference on information systems (ECIS’99), Copenhagen. Moody, D., & Walsh, P. (1999). Measuring the value of information: an asset valuation approach. In 7th european conference on information systems (ECIS’99), Copenhagen.
go back to reference Moore, F. (2004). Information lifecycle management. Melbourne: Horison Information Strategies. Moore, F. (2004). Information lifecycle management. Melbourne: Horison Information Strategies.
go back to reference Mont, M.C., & Beato, F. (2007). On parametric obligation policies: Enabling privacy-aware information lifecycle management in enterprises. In 8th IEEE international workshop on policies for distributed systems and networks (POLICY’07), Washington (pp. 51–55). Mont, M.C., & Beato, F. (2007). On parametric obligation policies: Enabling privacy-aware information lifecycle management in enterprises. In 8th IEEE international workshop on policies for distributed systems and networks (POLICY’07), Washington (pp. 51–55).
go back to reference Mucksch, H. (2006). Das Data-Warehouse als Datenbasis analytischer Informationssysteme. In Chamoni, P., & Gluchowski, P. (Eds.) Analytische Informations systeme (pp. 129–142). Heidelberg: Springer. Mucksch, H. (2006). Das Data-Warehouse als Datenbasis analytischer Informationssysteme. In Chamoni, P., & Gluchowski, P. (Eds.) Analytische Informations systeme (pp. 129–142). Heidelberg: Springer.
go back to reference Nancy, S., & Geetha Ramani, R. (2011). A comparison on performance of data-mining algorithms in classification of social network data. International Journal of Computer Applications, 32(8), 47–54. Nancy, S., & Geetha Ramani, R. (2011). A comparison on performance of data-mining algorithms in classification of social network data. International Journal of Computer Applications, 32(8), 47–54.
go back to reference Nguyen, M.H., & de la Torre, F. (2010). Optimal feature selection for support vector machines. Pattern Recognition, 43(3), 584–591.CrossRef Nguyen, M.H., & de la Torre, F. (2010). Optimal feature selection for support vector machines. Pattern Recognition, 43(3), 584–591.CrossRef
go back to reference Olson, D.L., & Delen, D. (2008). Advanced data-mining techniques. New York: Springer. Olson, D.L., & Delen, D. (2008). Advanced data-mining techniques. New York: Springer.
go back to reference Ossmann, K. (2008). Automatisierte Bewertung von Daten im SAP BW im Rahmen des ILM Diplomarbeit. Ilmenau: Technische Universität. Ossmann, K. (2008). Automatisierte Bewertung von Daten im SAP BW im Rahmen des ILM Diplomarbeit. Ilmenau: Technische Universität.
go back to reference Piller, G., & Hagedorn, J. (2011). Business benefits and application capabilities enabled by in-memory data management. In Lehner, W., & Piller, G. (Eds.) Innovative Unternehmensanwendungen mit In-Memory Data Management. LNI 193. GI, Mainz (pp. 45–56). Piller, G., & Hagedorn, J. (2011). Business benefits and application capabilities enabled by in-memory data management. In Lehner, W., & Piller, G. (Eds.) Innovative Unternehmensanwendungen mit In-Memory Data Management. LNI 193. GI, Mainz (pp. 45–56).
go back to reference Plattner, H., & Zeier, A. (2011). In-memory data management - an inflection point for enterprise applications. Berlin: Springer. Plattner, H., & Zeier, A. (2011). In-memory data management - an inflection point for enterprise applications. Berlin: Springer.
go back to reference Poess, M., & Nambiar, R.O. (2010 ). Tuning servers, storage and database for energy efficient data warehouses. In Proceedings IEEE 26th international conference on data engineering (pp. 1006–1017). Long Beach, CA. Poess, M., & Nambiar, R.O. (2010 ). Tuning servers, storage and database for energy efficient data warehouses. In Proceedings IEEE 26th international conference on data engineering (pp. 1006–1017). Long Beach, CA.
go back to reference Ramani, R., Kumar, S., & Jacob, S. (2012). Predicting fault-prone software modules using feature selection and classification through Data-Mining algorithms. In IEEE international conference on computational intelligence und computing research (ICCIC). Coimbatore (pp. 1–4). Ramani, R., Kumar, S., & Jacob, S. (2012). Predicting fault-prone software modules using feature selection and classification through Data-Mining algorithms. In IEEE international conference on computational intelligence und computing research (ICCIC). Coimbatore (pp. 1–4).
go back to reference Sajko, M., Rabuzin, K., & Baca, M. (2006). How to calculate information value for effective security risk assessment. Research Paper. University of Zagreb. Sajko, M., Rabuzin, K., & Baca, M. (2006). How to calculate information value for effective security risk assessment. Research Paper. University of Zagreb.
go back to reference Santry, D.S., Feeley, M.J., Hutchinson, N.C., Veitch, A.C., Carton, R.W., & Ofir, J. (1999). Deciding when to forget in the Elephant file system. In Operating systems review, (Vol. 34(5) pp. 110–123). Santry, D.S., Feeley, M.J., Hutchinson, N.C., Veitch, A.C., Carton, R.W., & Ofir, J. (1999). Deciding when to forget in the Elephant file system. In Operating systems review, (Vol. 34(5) pp. 110–123).
go back to reference Shah, G., Voruganti, K., Shivam, S., & Alvarez, M. (2006). ACE: classification for information lifecycle management. In 23rd IEEE conference on mass storage systems and technologies. College Park (pp. 1–7). Shah, G., Voruganti, K., Shivam, S., & Alvarez, M. (2006). ACE: classification for information lifecycle management. In 23rd IEEE conference on mass storage systems and technologies. College Park (pp. 1–7).
go back to reference Shen, L.Z. (2010). Research on hierarchical storage of digital library based on the information lifecycle management. In Proceedings of the 2nd IEEE international conference on information management and engineering (pp. 64–66). Shen, L.Z. (2010). Research on hierarchical storage of digital library based on the information lifecycle management. In Proceedings of the 2nd IEEE international conference on information management and engineering (pp. 64–66).
go back to reference Tallon, P.P., & Scannell, R. (2007). Information lifecycle management. Communication of the ACM, 50(11), 65–69.CrossRef Tallon, P.P., & Scannell, R. (2007). Information lifecycle management. Communication of the ACM, 50(11), 65–69.CrossRef
go back to reference Thome, G., & Sollbach, W. (2007). Grundlagen und Modelle des ILM. Heidelberg: Springer. Thome, G., & Sollbach, W. (2007). Grundlagen und Modelle des ILM. Heidelberg: Springer.
go back to reference Turczyk, L.A., Frei, C., Liebau, N., & Steinmetz, R. (2008). Eine Methode zur Wertzuweisung von Dateien in ILM. In Bichler, M. (Ed.) Proceedings MKWI 2008, Berlin (pp. 459–470). Turczyk, L.A., Frei, C., Liebau, N., & Steinmetz, R. (2008). Eine Methode zur Wertzuweisung von Dateien in ILM. In Bichler, M. (Ed.) Proceedings MKWI 2008, Berlin (pp. 459–470).
go back to reference Wald, R., Khoshgoftaar, T., & Napolitano, A. (2013a). The importance of performance metrics within wrapper feature selection. In 14th International Conference IRI. San Francisco (pp. 105–111). Wald, R., Khoshgoftaar, T., & Napolitano, A. (2013a). The importance of performance metrics within wrapper feature selection. In 14th International Conference IRI. San Francisco (pp. 105–111).
go back to reference Wald, R., Khoshgoftaar, T., & Napolitano, A. (2013b). Should the same learners be used both within wrapper feature selection and for building classification models?. In 25th international conference on tools with artificial intelligence (ICTAI). Herndon (pp. 439–445). Wald, R., Khoshgoftaar, T., & Napolitano, A. (2013b). Should the same learners be used both within wrapper feature selection and for building classification models?. In 25th international conference on tools with artificial intelligence (ICTAI). Herndon (pp. 439–445).
go back to reference Webster, J., & Watson, R.T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), 13–23. Webster, J., & Watson, R.T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), 13–23.
go back to reference Witten, I.H., Frank, E., & Hall, M.A. (2011). Data-mining: Practical machine learning tools and techniques. Amsterdam: Morgan Kaufmann. Witten, I.H., Frank, E., & Hall, M.A. (2011). Data-mining: Practical machine learning tools and techniques. Amsterdam: Morgan Kaufmann.
Metadata
Title
Automatic classification of data-warehouse-data for information lifecycle management using machine learning techniques
Authors
Sebastian Büsch
Volker Nissen
Arndt Wünscher
Publication date
26-07-2016
Publisher
Springer US
Published in
Information Systems Frontiers / Issue 5/2017
Print ISSN: 1387-3326
Electronic ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-016-9680-8

Other articles of this Issue 5/2017

Information Systems Frontiers 5/2017 Go to the issue

Premium Partner