skip to main content
research-article

Detecting outlying properties of exceptional objects

Published:23 April 2009Publication History
Skip Abstract Section

Abstract

Assume you are given a data population characterized by a certain number of attributes. Assume, moreover, you are provided with the information that one of the individuals in this data population is abnormal, but no reason whatsoever is given to you as to why this particular individual is to be considered abnormal. In several cases, you will be indeed interested in discovering such reasons. This article is precisely concerned with this problem of discovering sets of attributes that account for the (a priori stated) abnormality of an individual within a given dataset. A criterion is presented to measure the abnormality of combinations of attribute values featured by the given abnormal individual with respect to the reference population. In this respect, each subset of attributes is intended to somehow represent a “property” of individuals. We distinguish between global and local properties. Global properties are subsets of attributes explaining the given abnormality with respect to the entire data population. With local ones, instead, two subsets of attributes are singled out, where the former one justifies the abnormality within the data subpopulation selected using the values taken by the exceptional individual on those attributes included in the latter one. The problem of individuating abnormal properties with associated explanations is formally stated and analyzed. Such a formal characterization is then exploited in order to devise efficient algorithms for detecting both global and local forms of most abnormal properties. The experimental evidence, which is accounted for in the article, shows that the algorithms are both able to mine meaningful information and to accomplish the computational task by examining a negligible fraction of the search space.

References

  1. Aggarwal, C. C. and Yu, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD), 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Angiulli, F. and Pizzuti, C. 2002. Fast outlier detection in large high-dimensional datasets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Angiulli, F. and Pizzuti, C. 2005. Outlier mining in large high-dimensional datasets. IEEE Trans. Knowl. Data Eng. 17, 2, 203--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arning, A., Aggarwal, C., and Raghavan, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 164--169.Google ScholarGoogle Scholar
  6. Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley & Sons.Google ScholarGoogle Scholar
  7. Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth, Belmont.Google ScholarGoogle Scholar
  8. Breunig, M. M., Kriegel, H., Ng, R., and Sander, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD), 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chaudhuri, S. and Dayal, U. 1997. An overview of data warehousing and olap technology. SIGMOD Rec. 26, 1, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, Z. and Toda, S. 1995. The complexity of selecting maximal solutions. Inf. Comput. 1192, 2, 231--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Codd, E., Codd, S., and Salley, C. 1993. Providing OLAP (on-line analytical processing) to user-analysts: An it mandate. Tech. rep., Codd & Date, Inc.Google ScholarGoogle Scholar
  12. De Benedictis, G., Rose, G., Carrieri, G., Luca, M. D., Falcone, E., Passarino, G., Bonafè, M., Monti, D., Baggio, G., Bertolini, S., Mari, D., Mattace, R., and Franceschi, C. 1999. Mitochondrial DNA inherited variants are associates with successful aging and longevity in humans. FASEB J. 13, 12, 1532--1536.Google ScholarGoogle ScholarCross RefCross Ref
  13. Garasto, S., Berardelli, M., Rango, F. D., Mari, V., Feraco, E., and Benedictis, G. D. 2004. A study of the average effect of the 3'apob-vntr polymorphism on lipidemic parameters could explain why the short alleles (< 35 repeats) are rare in centenarians. BMC Medical Genetics 5, 3.Google ScholarGoogle ScholarCross RefCross Ref
  14. Garey, M. R. and Johnson, D. S. 1979. Computer and Intractability. W. H. Freeman and Company, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gini, C. 1921. Measurement of inequality of incomes. The Economic J. 31, 124--126.Google ScholarGoogle ScholarCross RefCross Ref
  16. Griffiths, A. J. F., Miller, J. H., Suzuki, D. T., Lewontin, R. C., and Gelbart, W. M. 1996. An Introduction to Genetic Analysis. W. H. Freeman.Google ScholarGoogle Scholar
  17. Hodge, V. and Austin, J. 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 2, 85--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Karp, R. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Plenum, New York, 85--103.Google ScholarGoogle Scholar
  19. Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining (KDD), 249--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Knorr, E. and Ng, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB), 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Knorr, E. and Ng, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases(VLDB), 211--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Knuth, D. 1998. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Krentel, M. 1988. The complexity of optimization problems. J. Comput. Syst. Sci. 36, 3, 490--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mehta, M., Agrawal, R., and Rissanen, J. 1996. Sliq: A fast scalable classifier for data mining. In International Conference on Extending Database Technology (EDBT), 18--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. 1998. UCI repository of machine learning databases.Google ScholarGoogle Scholar
  27. Papadimitriou, C. 1994. Computational Complexity. Addison-Wesley, USA.Google ScholarGoogle Scholar
  28. Passarino, G., Montesanto, A., Dato, S., Giordano, S., Domma, F., Mari, V., Feraco, E., and Benedictis, G. D. 2006. Sex and age specificity of susceptibility genes modulating survival at old age. Human Heredity (Int. J. Hum. Medical Genetics) 62, 4, 213--220.Google ScholarGoogle ScholarCross RefCross Ref
  29. Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the International Conference on Managment of Data (SIGMOD), 427--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of the International Conference on Principles of Knowledge and Reasoning (KR), 539--550.Google ScholarGoogle Scholar
  31. Sarawagi, S., Agrawal, R., and Megiddo, N. 1998. Discovery-Driven exploration of OLAP data cubes. In Proceedings of the International Conference on Extending Database Technology (EDBT), 168--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Selman, A. 1994. A taxonomy of complexity classes of functions. J. Comput. Syst. Sci. 48, 2, 357--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shafer, J., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 544--555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Suzuki, E. 2006. Data mining methods for discovering interesting exceptions from an unsupervised table. J. Universal Comput. Sci. 12, 6, 627--653.Google ScholarGoogle Scholar
  35. Tan, P. N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wei, L., Qian, W., Zhou, A., Jin, W., and Yu, J. 2003. Hot: Hypergraph-Based outlier test for categorical data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 399--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhang, J., Gao, Q., and Wang, H. 2006. A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm. In Proceedings of the International Conference on Data Mining (ICDM), 731--740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhang, J. and Wang, H. 2006. Detecting outlying subspaces for high-dimensional data: The new task, algorithms, and performance. Knowl. Inf. Syst. 10, 3, 333--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhu, C., Kitagawa, H., and Faloutsos, C. 2005. Example-Based robust outlier detection in high dimensional datasets. In Proceedings of the International Conference on Data Mining (ICDM), 829--832. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting outlying properties of exceptional objects

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 34, Issue 1
      April 2009
      349 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1508857
      Issue’s Table of Contents

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 April 2009
      • Accepted: 1 September 2008
      • Revised: 1 May 2008
      • Received: 1 September 2007
      Published in tods Volume 34, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader