Abstract
Assume you are given a data population characterized by a certain number of attributes. Assume, moreover, you are provided with the information that one of the individuals in this data population is abnormal, but no reason whatsoever is given to you as to why this particular individual is to be considered abnormal. In several cases, you will be indeed interested in discovering such reasons. This article is precisely concerned with this problem of discovering sets of attributes that account for the (a priori stated) abnormality of an individual within a given dataset. A criterion is presented to measure the abnormality of combinations of attribute values featured by the given abnormal individual with respect to the reference population. In this respect, each subset of attributes is intended to somehow represent a “property” of individuals. We distinguish between global and local properties. Global properties are subsets of attributes explaining the given abnormality with respect to the entire data population. With local ones, instead, two subsets of attributes are singled out, where the former one justifies the abnormality within the data subpopulation selected using the values taken by the exceptional individual on those attributes included in the latter one. The problem of individuating abnormal properties with associated explanations is formally stated and analyzed. Such a formal characterization is then exploited in order to devise efficient algorithms for detecting both global and local forms of most abnormal properties. The experimental evidence, which is accounted for in the article, shows that the algorithms are both able to mine meaningful information and to accomplish the computational task by examining a negligible fraction of the search space.
- Aggarwal, C. C. and Yu, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD), 37--46. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), 487--499. Google ScholarDigital Library
- Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in large high-dimensional datasets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 15--26. Google ScholarDigital Library
- Angiulli, F. and Pizzuti, C. 2005. Outlier mining in large high-dimensional datasets. IEEE Trans. Knowl. Data Eng. 17, 2, 203--215. Google ScholarDigital Library
- Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 164--169.Google Scholar
- Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.Google Scholar
- Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth, Belmont.Google Scholar
- Breunig, M. M., Kriegel, H., Ng, R., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD), 93--104. Google ScholarDigital Library
- Chaudhuri, S. and Dayal, U. 1997. An overview of data warehousing and olap technology. SIGMOD Rec. 26, 1, 65--74. Google ScholarDigital Library
- Chen, Z. and Toda, S. 1995. The complexity of selecting maximal solutions. Inf. Comput. 1192, 2, 231--239. Google ScholarDigital Library
- Codd, E., Codd, S., and Salley, C. 1993. Providing OLAP (on-line analytical processing) to user-analysts: An it mandate. Tech. rep., Codd & Date, Inc.Google Scholar
- De Benedictis, G., Rose, G., Carrieri, G., Luca, M. D., Falcone, E., Passarino, G., Bonafè, M., Monti, D., Baggio, G., Bertolini, S., Mari, D., Mattace, R., and Franceschi, C. 1999. Mitochondrial DNA inherited variants are associates with successful aging and longevity in humans. FASEB J. 13, 12, 1532--1536.Google ScholarCross Ref
- Garasto, S., Berardelli, M., Rango, F. D., Mari, V., Feraco, E., and Benedictis, G. D. 2004. A study of the average effect of the 3'apob-vntr polymorphism on lipidemic parameters could explain why the short alleles (< 35 repeats) are rare in centenarians. BMC Medical Genetics 5, 3.Google ScholarCross Ref
- Garey, M. R. and Johnson, D. S. 1979. Computer and Intractability. W. H. Freeman and Company, New York. Google ScholarDigital Library
- Gini, C. 1921. Measurement of inequality of incomes. The Economic J. 31, 124--126.Google ScholarCross Ref
- Griffiths, A. J. F., Miller, J. H., Suzuki, D. T., Lewontin, R. C., and Gelbart, W. M. 1996. An Introduction to Genetic Analysis. W. H. Freeman.Google Scholar
- Hodge, V. and Austin, J. 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 2, 85--126. Google ScholarDigital Library
- Karp, R. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Plenum, New York, 85--103.Google Scholar
- Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining (KDD), 249--271. Google ScholarDigital Library
- Knorr, E. and Ng, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB), 392--403. Google ScholarDigital Library
- Knorr, E. and Ng, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases(VLDB), 211--222. Google ScholarDigital Library
- Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237--253. Google ScholarDigital Library
- Knuth, D. 1998. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed. Addison-Wesley. Google ScholarDigital Library
- Krentel, M. 1988. The complexity of optimization problems. J. Comput. Syst. Sci. 36, 3, 490--509. Google ScholarDigital Library
- Mehta, M., Agrawal, R., and Rissanen, J. 1996. Sliq: A fast scalable classifier for data mining. In International Conference on Extending Database Technology (EDBT), 18--32. Google ScholarDigital Library
- Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. 1998. UCI repository of machine learning databases.Google Scholar
- Papadimitriou, C. 1994. Computational Complexity. Addison-Wesley, USA.Google Scholar
- Passarino, G., Montesanto, A., Dato, S., Giordano, S., Domma, F., Mari, V., Feraco, E., and Benedictis, G. D. 2006. Sex and age specificity of susceptibility genes modulating survival at old age. Human Heredity (Int. J. Hum. Medical Genetics) 62, 4, 213--220.Google ScholarCross Ref
- Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the International Conference on Managment of Data (SIGMOD), 427--438. Google ScholarDigital Library
- Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge and Reasoning (KR), 539--550.Google Scholar
- Sarawagi, S., Agrawal, R., and Megiddo, N. 1998. Discovery-Driven exploration of OLAP data cubes. In Proceedings of the International Conference on Extending Database Technology (EDBT), 168--182. Google ScholarDigital Library
- Selman, A. 1994. A taxonomy of complexity classes of functions. J. Comput. Syst. Sci. 48, 2, 357--381. Google ScholarDigital Library
- Shafer, J., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 544--555. Google ScholarDigital Library
- Suzuki, E. 2006. Data mining methods for discovering interesting exceptions from an unsupervised table. J. Universal Comput. Sci. 12, 6, 627--653.Google Scholar
- Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley. Google ScholarDigital Library
- Wei, L., Qian, W., Zhou, A., Jin, W., and Yu, J. 2003. Hot: Hypergraph-Based outlier test for categorical data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 399--410. Google ScholarDigital Library
- Zhang, J., Gao, Q., and Wang, H. 2006. A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In Proceedings of the International Conference on Data Mining (ICDM), 731--740. Google ScholarDigital Library
- Zhang, J. and Wang, H. 2006. Detecting outlying subspaces for high-dimensional data: The new task, algorithms, and performance. Knowl. Inf. Syst. 10, 3, 333--355. Google ScholarDigital Library
- Zhu, C., Kitagawa, H., and Faloutsos, C. 2005. Example-Based robust outlier detection in high dimensional datasets. In Proceedings of the International Conference on Data Mining (ICDM), 829--832. Google ScholarDigital Library
Index Terms
- Detecting outlying properties of exceptional objects
Recommendations
Mining Optimized Association Rules with Categorical and Numeric Attributes
Mining association rules on large data sets has received considerable attention in recent years. Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, financial, and retail ...
Dataless Transitions Between Concise Representations of Frequent Patterns
For many data mining problems in order to solve them it is required to discover frequent patterns. Frequent itemsets are useful e.g. in the discovery of association and episode rules, sequential patterns and clusters. Nevertheless, the number of ...
Mining N-most interesting itemsets without support threshold by the COFI-tree
Data mining is the discovery of interesting and hidden patterns from a large amount of collected data. Applications can be found in many organisations with large databases, for many different purposes such as customer relationships, marketing, planning, ...
Comments