ABSTRACT
On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.
- Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an olap environment. In VLDB, volume 220, page 231, 2000. Google ScholarDigital Library
- Nikolay Balov and Peter Salzman. How to use the catnet package, 2016.Google Scholar
- Peter J Bickel, Eugene A Hammel, J William O'Connell, et al. Sex bias in graduate admissions: Data from berkeley. Science, 187(4175):398--404, 1975.Google ScholarCross Ref
- Carsten Binnig, Lorenzo De Stefani, Tim Kraska, Eli Upfal, Emanuel Zgraggen, and Zheguang Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.Google Scholar
- Xavier De Luna, Ingeborg Waernbaum, and Thomas S Richardson. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, page asr041, 2011.Google Scholar
- Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google Scholar
- Carem C Fabris and Alex A Freitas. Discovering surprising instances of simpson's paradox in hierarchical multidimensional data. International Journal of Data Warehousing and Mining (IJDWM), 2(1):27--49, 2006.Google ScholarCross Ref
- Sheila R Foster. Causation in antidiscrimination law: Beyond intent versus impact. Hous. L. Rev., 41:1469, 2004.Google Scholar
- Alex A Freitas. Understanding the crucial role of attribute interaction in data mining. Artificial Intelligence Review, 16(3):177--199, 2001. Google ScholarDigital Library
- Alex A Freitas. Are we really discovering interesting knowledge from data. Expert Update (the BCS-SGAI magazine), 9(1):41--47, 2006.Google Scholar
- Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth. Statistical themes and lessons for data mining. Data mining and knowledge discovery, 1(1):11--28, 1997. Google ScholarDigital Library
- Phillip Good. Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.Google Scholar
- Priscilla E Greenwood and Michael S Nikulin. A guide to chi-squared testing, volume 280. John Wiley &Sons, 1996.Google Scholar
- Yue Guo, Carsten Binnig, and Tim Kraska. What you see is not what you get!: Detecting simpson's paradoxes during data exploration. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, page 2. ACM, 2017. Google ScholarDigital Library
- Isabelle Guyon. Lung cancer simple model, 10 2009.Google Scholar
- David Heckerman et al. A tutorial on learning with bayesian networks. Nato Asi Series D Behavioural And Social Sciences, 89:301--354, 1998.Google Scholar
- Paul W Holland. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960, 1986.Google Scholar
- Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):pp. 945--960, 1986.Google ScholarCross Ref
- Stefano M Iacus, Gary King, Giuseppe Porro, et al. Cem: software for coarsened exact matching. Journal of Statistical Software, 30(9):1--27, 2009.Google ScholarCross Ref
- Laks VS Lakshmanan, Alex Russakovsky, and Vaishnavi Sashikanth. What-if olap queries with changing dimensions. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 1334--1336. IEEE, 2008. Google ScholarDigital Library
- Junning Li and Z Jane Wang. Controlling the false discovery rate of the association/causality structure learned with the pc algorithm. Journal of Machine Learning Research, 10(Feb):475--514, 2009. Google ScholarDigital Library
- M. Lichman. Uci machine learning repository, 2013.Google Scholar
- Shili Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555--570, 2010. Google ScholarDigital Library
- Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502--510. ACM, 2011. Google ScholarDigital Library
- Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in neural information processing systems, pages 505--511, 2000. Google ScholarDigital Library
- John H McDonald. Handbook of biological statistics, volume 2. Sparky House Publishing, 2009.Google Scholar
- Wes McKinney. pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, pages 1--9, 2011.Google Scholar
- George A Miller. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2(95):100, 1955.Google Scholar
- Radhakrishnan Nagarajan, Marco Scutari, and Sophie Lèbre. Bayesian networks in r. Springer, 122:125--127, 2013.Google Scholar
- Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004.Google ScholarDigital Library
- Eric Neufeld and Sonje Kristtorn. Whether non-correlation implies non-causation. In FLAIRS Conference, pages 772--777, 2005.Google Scholar
- WM Patefield. Algorithm as 159: an efficient method of generating random r× c tables with given row and column totals. Journal of the Royal Statistical Society. Series C (Applied Statistics), 30(1):91--97, 1981.Google Scholar
- Judea Pearl. {bayesian analysis in expert systems}: Comment: Graphical models, causality and intervention. Statistical Science, 8(3):266--269, 1993.Google ScholarCross Ref
- Judea Pearl. Direct and indirect effects. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence, pages 411--420. Morgan Kaufmann Publishers Inc., 2001. Google ScholarDigital Library
- Judea Pearl. Causality. Cambridge university press, 2009.Google Scholar
- Judea Pearl. An introduction to causal inference. The international journal of biostatistics, 6(2), 2010.Google Scholar
- Judea Pearl. Simpson's paradox: An anatomy. Department of Statistics, UCLA, 2011.Google Scholar
- Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 2014.Google ScholarDigital Library
- Judea Pearl et al. Causal inference in statistics: An overview. Statistics Surveys, 3:96--146, 2009.Google ScholarCross Ref
- Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560--568. ACM, 2008. Google ScholarDigital Library
- Jean-Philippe Pellet and André Elisseeff. Using markov blankets for causal structure learning. Journal of Machine Learning Research, 9(Jul):1295--1342, 2008. Google ScholarDigital Library
- Dataset: Airline On-Time Performance. http://www.transtats.bts.gov/.Google Scholar
- Clark Glymour Peter Spirtes and Richard Scheines. Causation, Prediction and Search. MIT Press, 2001.Google ScholarCross Ref
- Donald B Rubin. The Use of Matched Sampling and Regression Adjustment in Observational Studies. Ph.D. Thesis, Department of Statistics, Harvard University, Cambridge, MA, 1970.Google Scholar
- Donald B Rubin. Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396):961--962, 1986.Google Scholar
- Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.Google Scholar
- Florian Tramer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, and Huang Lin. Fairtest: Discovering unwarranted associations in data-driven applications. In Security and Privacy (EuroS&P), 2017 IEEE European Symposium on, pages 401--416. IEEE, 2017.Google Scholar
- Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376--380, 2003.Google Scholar
- Jennifer Valentino-Devries, Jeremy Singer-Vine, and Ashkan Soltani. Websites vary prices, deals based on users' information. Wall Street Journal, 10:60--68, 2012.Google Scholar
- Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325--333, 2013. Google ScholarDigital Library
- Indre vZliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992--1001. IEEE, 2011. Google ScholarDigital Library
Index Terms
- Bias in OLAP Queries: Detection, Explanation, and Removal
Recommendations
Towards intensional answers to OLAP queries for analytical sessions
DOLAP '12: Proceedings of the fifteenth international workshop on Data warehousing and OLAPOne of the problems in analyzing large multidimensional databases through OLAP sessions is that decision makers can be overwhelmed by the size of query answers, while they need a concise summary of data. Intensional query answering can help by providing ...
Finding an efficient rewriting of OLAP queries using materialized views in data warehouses
OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which already exist ...
A personalization framework for OLAP queries
DOLAP '05: Proceedings of the 8th ACM international workshop on Data warehousing and OLAPOLAP users heavily rely on visualization of query answers for their interactive analysis of massive amounts of data. Very often, these answers cannot be visualized entirely and the user has to navigate through them to find relevant facts.In this paper, ...
Comments