Skip to main content

2016 | OriginalPaper | Buchkapitel

Influence of Outliers Introduction on Predictive Models Quality

verfasst von : Mateusz Kalisch, Marcin Michalak, Marek Sikora, Łukasz Wróbel, Piotr Przystałka

Erschienen in: Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The paper presents results of the research related to influence of the level of outliers in the data (train and test data considered separately) on the quality of a model prediction in a classification task. The set of 100 semi–artificial time series was taken into consideration, which independent variables was close to real ones, observed in a underground coal mining environment and dependent variable was generated with the decision tree. For every considered method (decision trees, naive bayes, logistic regression and kNN) a reference model was built (no outliers in the data) which quality was compared with the quality of two models: Out–Out (outliers in train and test data) and Non-out–Out (outliers only in test data). 50 levels of outliers in the data were considered, from 1 % to 50 %. Statistical comparison of models was done on the basis of sign test.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Ahmed, B., Thesen, T., Blackmon, K.E., Zhao, Y., Devinsky, O., Kuzniecky, R., Brodley, C.E.: Hierarchical conditional random fields for outlier detection: an application to detecting epileptogenic cortical malformations. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China (2014) Ahmed, B., Thesen, T., Blackmon, K.E., Zhao, Y., Devinsky, O., Kuzniecky, R., Brodley, C.E.: Hierarchical conditional random fields for outlier detection: an application to detecting epileptogenic cortical malformations. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China (2014)
3.
Zurück zum Zitat Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)MATH Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)MATH
4.
Zurück zum Zitat Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM Internation Conference on Data Mining, pp. 243–254 (2008) Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM Internation Conference on Data Mining, pp. 243–254 (2008)
5.
Zurück zum Zitat Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000) Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000)
6.
Zurück zum Zitat Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)CrossRef Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)CrossRef
7.
Zurück zum Zitat Byers, S., Raftery, A.E.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1998)CrossRefMATH Byers, S., Raftery, A.E.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1998)CrossRefMATH
8.
Zurück zum Zitat Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
9.
Zurück zum Zitat Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 53–62. ACM, New York (1999) Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 53–62. ACM, New York (1999)
11.
Zurück zum Zitat Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)CrossRef Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)CrossRef
12.
Zurück zum Zitat Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)MathSciNetCrossRefMATH Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)MathSciNetCrossRefMATH
13.
Zurück zum Zitat Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics. Springer, Netherlands (1980)CrossRefMATH Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics. Springer, Netherlands (1980)CrossRefMATH
14.
Zurück zum Zitat Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATH Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATH
15.
Zurück zum Zitat Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings 14th International Joint Conference Artificial Intelligence, pp. 518–523 (1995) Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings 14th International Joint Conference Artificial Intelligence, pp. 518–523 (1995)
16.
Zurück zum Zitat John, G.H.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995) John, G.H.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995)
17.
Zurück zum Zitat Johnson, T., Kwok, I., Ng, R.T.: Fast computation of 2-dimensional depth contours. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. (eds.) Internation Conference on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press (1998) Johnson, T., Kwok, I., Ng, R.T.: Fast computation of 2-dimensional depth contours. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. (eds.) Internation Conference on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press (1998)
19.
Zurück zum Zitat Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)CrossRef Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)CrossRef
20.
Zurück zum Zitat Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 613–618. ACM, New York (2003) Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 613–618. ACM, New York (2003)
21.
Zurück zum Zitat Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000)CrossRef Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000)CrossRef
22.
Zurück zum Zitat Ritter, G., Gallegos, M.T.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recogn. Lett. 18(6), 525–539 (1997)CrossRef Ritter, G., Gallegos, M.T.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recogn. Lett. 18(6), 525–539 (1997)CrossRef
23.
Zurück zum Zitat Rousseeuw, P.J.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel, Dordrecht (1985)CrossRef Rousseeuw, P.J.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel, Dordrecht (1985)CrossRef
24.
Zurück zum Zitat Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)CrossRefMATH Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)CrossRefMATH
25.
Zurück zum Zitat Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, pp. 582–588. MIT Press (2000) Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, pp. 582–588. MIT Press (2000)
26.
Zurück zum Zitat Torr, P.H.S., Murray, D.W.: Outlier detection and motion segmentation, vol. 2059, pp. 432–443 (1993) Torr, P.H.S., Murray, D.W.: Outlier detection and motion segmentation, vol. 2059, pp. 432–443 (1993)
27.
Zurück zum Zitat Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)MATH Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)MATH
28.
Zurück zum Zitat Weisberg, S.: Applied Linear Regression. Wiley Series in Probability and Statistics, 3rd edn. Wiley & Sons, Hoboken (2005)CrossRefMATH Weisberg, S.: Applied Linear Regression. Wiley Series in Probability and Statistics, 3rd edn. Wiley & Sons, Hoboken (2005)CrossRefMATH
Metadaten
Titel
Influence of Outliers Introduction on Predictive Models Quality
verfasst von
Mateusz Kalisch
Marcin Michalak
Marek Sikora
Łukasz Wróbel
Piotr Przystałka
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-34099-9_5