Skip to main content
Erschienen in: Neural Computing and Applications 13/2020

16.08.2019 | Original Article

An outlier detection approach in large-scale data stream using rough set

verfasst von: Manmohan Singh, Rajendra Pamula

Erschienen in: Neural Computing and Applications | Ausgabe 13/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Outlier detection has become an important research area in the field of stream data mining due to its vast applications. In the literature, many methods have been proposed, but they work well for simple and positive regions of outliers, where boundary regions are not given much importance. Moreover, an algorithm which processes stream data must be effective and able to compute infinite data in one pass or limited number of passes. These problems have motivated us to propose an outlier detection approach for large-scale data stream. The proposed algorithm employs the concept of relative cardinality, entropy outlier factor theory of information-based system, and size-variant sliding window in stream data. In addition, we propose a new methodology for concept drift adaptation on evolving data streams. The proposed method is executed on nine benchmark datasets and compared with six existing methods that are EXPoSE, iForest, OC-SVM, LOF, KDE, and FastAbod. Experimental results show that the proposed method outperforms six existing methods in terms of receiver operating characteristic curve, precision recall, and computational time for positive regions as well as for boundary regions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ghosh S, Biswas S, Sarkar D, Sarkar PP (2014) A novel neuro-fuzzy classification technique for data mining. Egypt Inform J 15(3):129–147 Ghosh S, Biswas S, Sarkar D, Sarkar PP (2014) A novel neuro-fuzzy classification technique for data mining. Egypt Inform J 15(3):129–147
2.
Zurück zum Zitat Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474 Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474
3.
Zurück zum Zitat Ghosh D, Vogt A (2012) Outliers: an evaluation of methodologies. In: Joint statistical meetings. American Statistical Association San Diego, CA, pp 3455–3460 Ghosh D, Vogt A (2012) Outliers: an evaluation of methodologies. In: Joint statistical meetings. American Statistical Association San Diego, CA, pp 3455–3460
4.
Zurück zum Zitat Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New YorkMATH Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New YorkMATH
5.
Zurück zum Zitat Zhang B, Sconyers C, Byington C, Patrick R, Orchard ME, Vachtsevanos G (2011) A probabilistic fault detection approach: application to bearing fault detection. IEEE Trans Ind Electron 58(5):2011–2018 Zhang B, Sconyers C, Byington C, Patrick R, Orchard ME, Vachtsevanos G (2011) A probabilistic fault detection approach: application to bearing fault detection. IEEE Trans Ind Electron 58(5):2011–2018
6.
Zurück zum Zitat Xiong L, Poczos B, Schneider J, Connolly A, VanderPlas J (2011) Hierarchical probabilistic models for group anomaly detection. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 789–797 Xiong L, Poczos B, Schneider J, Connolly A, VanderPlas J (2011) Hierarchical probabilistic models for group anomaly detection. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 789–797
7.
Zurück zum Zitat Han D-H, Zhang X, Wang G-R (2015) Classifying uncertain and evolving data streams with distributed extreme learning machine. J Comput Sci Technol 30(4):874–887MathSciNet Han D-H, Zhang X, Wang G-R (2015) Classifying uncertain and evolving data streams with distributed extreme learning machine. J Comput Sci Technol 30(4):874–887MathSciNet
8.
Zurück zum Zitat Shojafar M, Cordeschi N, Baccarelli E (2016) Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Trans Cloud Comput 7(1):196–209 Shojafar M, Cordeschi N, Baccarelli E (2016) Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Trans Cloud Comput 7(1):196–209
9.
Zurück zum Zitat Beaubouef T, Petry FE, Arora G (1998) Information-theoretic measures of uncertainty for rough sets and rough relational databases. Inf Sci 109(1–4):185–195 Beaubouef T, Petry FE, Arora G (1998) Information-theoretic measures of uncertainty for rough sets and rough relational databases. Inf Sci 109(1–4):185–195
10.
Zurück zum Zitat Liang J, Shi Z (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncert Fuzziness Knowl Based Syst 12(01):37–46MathSciNetMATH Liang J, Shi Z (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncert Fuzziness Knowl Based Syst 12(01):37–46MathSciNetMATH
11.
Zurück zum Zitat Duntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106(1):109–137MathSciNetMATH Duntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106(1):109–137MathSciNetMATH
12.
Zurück zum Zitat Xie N, Liu M, Li Z, Zhang G (2019) New measures of uncertainty for an interval-valued information system. Inf Sci 470:156–174MathSciNet Xie N, Liu M, Li Z, Zhang G (2019) New measures of uncertainty for an interval-valued information system. Inf Sci 470:156–174MathSciNet
13.
Zurück zum Zitat Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12 Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12
14.
Zurück zum Zitat Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267MATH Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267MATH
15.
Zurück zum Zitat Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253 Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253
16.
Zurück zum Zitat Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687 Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687
17.
Zurück zum Zitat Shoval P, Gudes E, Goldstein M (1988) Gisd: a graphical interactive system for conceptual database design. Inf Syst 13(1):81–95 Shoval P, Gudes E, Goldstein M (1988) Gisd: a graphical interactive system for conceptual database design. Inf Syst 13(1):81–95
18.
Zurück zum Zitat Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record. ACM, vol 29, pp 93–104 Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record. ACM, vol 29, pp 93–104
19.
Zurück zum Zitat Yao H, Xiuwen F, Yang Y, Postolache O (2018) An incremental local outlier detection method in the data stream. Appl Sci 8(8):1248 Yao H, Xiuwen F, Yang Y, Postolache O (2018) An incremental local outlier detection method in the data stream. Appl Sci 8(8):1248
20.
Zurück zum Zitat Kriegel H-P, Zimek A et al (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 444–452 Kriegel H-P, Zimek A et al (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 444–452
21.
Zurück zum Zitat Aggarwal CC (2015) Outlier analysis: advanced concepts. In: Data mining. Springer, pp 265–283 Aggarwal CC (2015) Outlier analysis: advanced concepts. In: Data mining. Springer, pp 265–283
22.
Zurück zum Zitat Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1511 Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1511
23.
Zurück zum Zitat Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3 Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3
24.
Zurück zum Zitat Mahadevan S, Shah SL (2009) Fault detection and diagnosis in process data using one-class support vector machines. J Process Control 19(10):1627–1639 Mahadevan S, Shah SL (2009) Fault detection and diagnosis in process data using one-class support vector machines. J Process Control 19(10):1627–1639
25.
Zurück zum Zitat Barddal JP, Gomes HM, Enembreck F, Barthes J-P (2016) Sncstream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73 Barddal JP, Gomes HM, Enembreck F, Barthes J-P (2016) Sncstream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
26.
Zurück zum Zitat Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333MathSciNetMATH Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333MathSciNetMATH
27.
Zurück zum Zitat Zhang J, Li T, Ruan D, Gao Z, Zhao C (2012) A parallel method for computing rough set approximations. Inf Sci 194:209–223 Zhang J, Li T, Ruan D, Gao Z, Zhao C (2012) A parallel method for computing rough set approximations. Inf Sci 194:209–223
28.
Zurück zum Zitat Hu X (1995) Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesis, University of Regina Hu X (1995) Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesis, University of Regina
29.
Zurück zum Zitat Liang J, Zongben X (2002) The algorithm on knowledge reduction in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 10(01):95–103MathSciNetMATH Liang J, Zongben X (2002) The algorithm on knowledge reduction in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 10(01):95–103MathSciNetMATH
30.
Zurück zum Zitat Qian Y, Liang J, Wang F (2009) A new method for measuring the uncertainty in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 17(06):855–880MathSciNetMATH Qian Y, Liang J, Wang F (2009) A new method for measuring the uncertainty in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 17(06):855–880MathSciNetMATH
31.
Zurück zum Zitat Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28(4):459–471 Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28(4):459–471
32.
Zurück zum Zitat Park I-K, Choi G-S (2015) A variable-precision information-entropy rough set approach for job searching. Inf Syst 48:279–288 Park I-K, Choi G-S (2015) A variable-precision information-entropy rough set approach for job searching. Inf Syst 48:279–288
33.
Zurück zum Zitat Parra L, Deco G, Miesbach S (1996) Statistical independence and novelty detection with information preserving nonlinear maps. Neural Comput 8(2):260–269 Parra L, Deco G, Miesbach S (1996) Statistical independence and novelty detection with information preserving nonlinear maps. Neural Comput 8(2):260–269
34.
Zurück zum Zitat Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602 Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602
35.
Zurück zum Zitat Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Comput Surv 52(2):38 Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Comput Surv 52(2):38
36.
Zurück zum Zitat Park I-K, Choi G-S (2015) Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf Syst 48:289–295 Park I-K, Choi G-S (2015) Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf Syst 48:289–295
37.
Zurück zum Zitat D’eer L, Cornelis C (2018) A comprehensive study of fuzzy covering-based rough set models: definitions, properties and interrelationships. Fuzzy Sets Syst 336:1–26MathSciNetMATH D’eer L, Cornelis C (2018) A comprehensive study of fuzzy covering-based rough set models: definitions, properties and interrelationships. Fuzzy Sets Syst 336:1–26MathSciNetMATH
38.
Zurück zum Zitat Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110 Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110
39.
Zurück zum Zitat Le Q, Sarlos T, Smola A (2013) Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the international conference on machine learning, vol 85 Le Q, Sarlos T, Smola A (2013) Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the international conference on machine learning, vol 85
40.
Zurück zum Zitat Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 306–315 Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 306–315
41.
Zurück zum Zitat Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30(2):37–46 Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30(2):37–46
42.
43.
Zurück zum Zitat Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In:: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 8–15 Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In:: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 8–15
44.
Zurück zum Zitat Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173 Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173
45.
Zurück zum Zitat Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180 Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
46.
Zurück zum Zitat Campos GO, Zimek A, Sander J, Campello RJGB, Micenkova B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927MathSciNet Campos GO, Zimek A, Sander J, Campello RJGB, Micenkova B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927MathSciNet
47.
Zurück zum Zitat Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Advances in neural information processing systems, pp 467–475 Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Advances in neural information processing systems, pp 467–475
Metadaten
Titel
An outlier detection approach in large-scale data stream using rough set
verfasst von
Manmohan Singh
Rajendra Pamula
Publikationsdatum
16.08.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 13/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04421-4

Weitere Artikel der Ausgabe 13/2020

Neural Computing and Applications 13/2020 Zur Ausgabe

Premium Partner