skip to main content
10.1145/3183713.3196914acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Bias in OLAP Queries: Detection, Explanation, and Removal

Published:27 May 2018Publication History

ABSTRACT

On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.

References

  1. Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an olap environment. In VLDB, volume 220, page 231, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Nikolay Balov and Peter Salzman. How to use the catnet package, 2016.Google ScholarGoogle Scholar
  3. Peter J Bickel, Eugene A Hammel, J William O'Connell, et al. Sex bias in graduate admissions: Data from berkeley. Science, 187(4175):398--404, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  4. Carsten Binnig, Lorenzo De Stefani, Tim Kraska, Eli Upfal, Emanuel Zgraggen, and Zheguang Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.Google ScholarGoogle Scholar
  5. Xavier De Luna, Ingeborg Waernbaum, and Thomas S Richardson. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, page asr041, 2011.Google ScholarGoogle Scholar
  6. Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google ScholarGoogle Scholar
  7. Carem C Fabris and Alex A Freitas. Discovering surprising instances of simpson's paradox in hierarchical multidimensional data. International Journal of Data Warehousing and Mining (IJDWM), 2(1):27--49, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. Sheila R Foster. Causation in antidiscrimination law: Beyond intent versus impact. Hous. L. Rev., 41:1469, 2004.Google ScholarGoogle Scholar
  9. Alex A Freitas. Understanding the crucial role of attribute interaction in data mining. Artificial Intelligence Review, 16(3):177--199, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alex A Freitas. Are we really discovering interesting knowledge from data. Expert Update (the BCS-SGAI magazine), 9(1):41--47, 2006.Google ScholarGoogle Scholar
  11. Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth. Statistical themes and lessons for data mining. Data mining and knowledge discovery, 1(1):11--28, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Phillip Good. Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.Google ScholarGoogle Scholar
  13. Priscilla E Greenwood and Michael S Nikulin. A guide to chi-squared testing, volume 280. John Wiley &Sons, 1996.Google ScholarGoogle Scholar
  14. Yue Guo, Carsten Binnig, and Tim Kraska. What you see is not what you get!: Detecting simpson's paradoxes during data exploration. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, page 2. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Isabelle Guyon. Lung cancer simple model, 10 2009.Google ScholarGoogle Scholar
  16. David Heckerman et al. A tutorial on learning with bayesian networks. Nato Asi Series D Behavioural And Social Sciences, 89:301--354, 1998.Google ScholarGoogle Scholar
  17. Paul W Holland. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960, 1986.Google ScholarGoogle Scholar
  18. Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):pp. 945--960, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  19. Stefano M Iacus, Gary King, Giuseppe Porro, et al. Cem: software for coarsened exact matching. Journal of Statistical Software, 30(9):1--27, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  20. Laks VS Lakshmanan, Alex Russakovsky, and Vaishnavi Sashikanth. What-if olap queries with changing dimensions. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 1334--1336. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Junning Li and Z Jane Wang. Controlling the false discovery rate of the association/causality structure learned with the pc algorithm. Journal of Machine Learning Research, 10(Feb):475--514, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Lichman. Uci machine learning repository, 2013.Google ScholarGoogle Scholar
  23. Shili Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555--570, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502--510. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in neural information processing systems, pages 505--511, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. John H McDonald. Handbook of biological statistics, volume 2. Sparky House Publishing, 2009.Google ScholarGoogle Scholar
  27. Wes McKinney. pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, pages 1--9, 2011.Google ScholarGoogle Scholar
  28. George A Miller. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2(95):100, 1955.Google ScholarGoogle Scholar
  29. Radhakrishnan Nagarajan, Marco Scutari, and Sophie Lèbre. Bayesian networks in r. Springer, 122:125--127, 2013.Google ScholarGoogle Scholar
  30. Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Eric Neufeld and Sonje Kristtorn. Whether non-correlation implies non-causation. In FLAIRS Conference, pages 772--777, 2005.Google ScholarGoogle Scholar
  32. WM Patefield. Algorithm as 159: an efficient method of generating random r× c tables with given row and column totals. Journal of the Royal Statistical Society. Series C (Applied Statistics), 30(1):91--97, 1981.Google ScholarGoogle Scholar
  33. Judea Pearl. {bayesian analysis in expert systems}: Comment: Graphical models, causality and intervention. Statistical Science, 8(3):266--269, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  34. Judea Pearl. Direct and indirect effects. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence, pages 411--420. Morgan Kaufmann Publishers Inc., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Judea Pearl. Causality. Cambridge university press, 2009.Google ScholarGoogle Scholar
  36. Judea Pearl. An introduction to causal inference. The international journal of biostatistics, 6(2), 2010.Google ScholarGoogle Scholar
  37. Judea Pearl. Simpson's paradox: An anatomy. Department of Statistics, UCLA, 2011.Google ScholarGoogle Scholar
  38. Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Judea Pearl et al. Causal inference in statistics: An overview. Statistics Surveys, 3:96--146, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  40. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560--568. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jean-Philippe Pellet and André Elisseeff. Using markov blankets for causal structure learning. Journal of Machine Learning Research, 9(Jul):1295--1342, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Dataset: Airline On-Time Performance. http://www.transtats.bts.gov/.Google ScholarGoogle Scholar
  43. Clark Glymour Peter Spirtes and Richard Scheines. Causation, Prediction and Search. MIT Press, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  44. Donald B Rubin. The Use of Matched Sampling and Regression Adjustment in Observational Studies. Ph.D. Thesis, Department of Statistics, Harvard University, Cambridge, MA, 1970.Google ScholarGoogle Scholar
  45. Donald B Rubin. Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396):961--962, 1986.Google ScholarGoogle Scholar
  46. Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.Google ScholarGoogle Scholar
  47. Florian Tramer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, and Huang Lin. Fairtest: Discovering unwarranted associations in data-driven applications. In Security and Privacy (EuroS&P), 2017 IEEE European Symposium on, pages 401--416. IEEE, 2017.Google ScholarGoogle Scholar
  48. Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376--380, 2003.Google ScholarGoogle Scholar
  49. Jennifer Valentino-Devries, Jeremy Singer-Vine, and Ashkan Soltani. Websites vary prices, deals based on users' information. Wall Street Journal, 10:60--68, 2012.Google ScholarGoogle Scholar
  50. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325--333, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Indre vZliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992--1001. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bias in OLAP Queries: Detection, Explanation, and Removal

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
        May 2018
        1874 pages
        ISBN:9781450347037
        DOI:10.1145/3183713

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader