research-article

Public Access

Bias in OLAP Queries: Detection, Explanation, and Removal

Authors:
Babak Salimi

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Johannes Gehrke

Microsoft, Seattle, WA, USA

Microsoft, Seattle, WA, USA
View Profile

,
Dan Suciu

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 1021–1035https://doi.org/10.1145/3183713.3196914

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1021–1035

ABSTRACT

On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a biased query, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case.

References

Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an olap environment. In VLDB, volume 220, page 231, 2000. Google ScholarDigital Library
Nikolay Balov and Peter Salzman. How to use the catnet package, 2016.Google Scholar
Peter J Bickel, Eugene A Hammel, J William O'Connell, et al. Sex bias in graduate admissions: Data from berkeley. Science, 187(4175):398--404, 1975.Google ScholarCross Ref
Carsten Binnig, Lorenzo De Stefani, Tim Kraska, Eli Upfal, Emanuel Zgraggen, and Zheguang Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.Google Scholar
Xavier De Luna, Ingeborg Waernbaum, and Thomas S Richardson. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, page asr041, 2011.Google Scholar
Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.Google Scholar
Carem C Fabris and Alex A Freitas. Discovering surprising instances of simpson's paradox in hierarchical multidimensional data. International Journal of Data Warehousing and Mining (IJDWM), 2(1):27--49, 2006.Google ScholarCross Ref
Sheila R Foster. Causation in antidiscrimination law: Beyond intent versus impact. Hous. L. Rev., 41:1469, 2004.Google Scholar
Alex A Freitas. Understanding the crucial role of attribute interaction in data mining. Artificial Intelligence Review, 16(3):177--199, 2001. Google ScholarDigital Library
Alex A Freitas. Are we really discovering interesting knowledge from data. Expert Update (the BCS-SGAI magazine), 9(1):41--47, 2006.Google Scholar
Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth. Statistical themes and lessons for data mining. Data mining and knowledge discovery, 1(1):11--28, 1997. Google ScholarDigital Library
Phillip Good. Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.Google Scholar
Priscilla E Greenwood and Michael S Nikulin. A guide to chi-squared testing, volume 280. John Wiley &Sons, 1996.Google Scholar
Yue Guo, Carsten Binnig, and Tim Kraska. What you see is not what you get!: Detecting simpson's paradoxes during data exploration. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, page 2. ACM, 2017. Google ScholarDigital Library
Isabelle Guyon. Lung cancer simple model, 10 2009.Google Scholar
David Heckerman et al. A tutorial on learning with bayesian networks. Nato Asi Series D Behavioural And Social Sciences, 89:301--354, 1998.Google Scholar
Paul W Holland. Statistics and causal inference. Journal of the American statistical Association, 81(396):945--960, 1986.Google Scholar
Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):pp. 945--960, 1986.Google ScholarCross Ref
Stefano M Iacus, Gary King, Giuseppe Porro, et al. Cem: software for coarsened exact matching. Journal of Statistical Software, 30(9):1--27, 2009.Google ScholarCross Ref
Laks VS Lakshmanan, Alex Russakovsky, and Vaishnavi Sashikanth. What-if olap queries with changing dimensions. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 1334--1336. IEEE, 2008. Google ScholarDigital Library
Junning Li and Z Jane Wang. Controlling the false discovery rate of the association/causality structure learned with the pc algorithm. Journal of Machine Learning Research, 10(Feb):475--514, 2009. Google ScholarDigital Library
M. Lichman. Uci machine learning repository, 2013.Google Scholar
Shili Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555--570, 2010. Google ScholarDigital Library
Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 502--510. ACM, 2011. Google ScholarDigital Library
Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in neural information processing systems, pages 505--511, 2000. Google ScholarDigital Library
John H McDonald. Handbook of biological statistics, volume 2. Sparky House Publishing, 2009.Google Scholar
Wes McKinney. pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, pages 1--9, 2011.Google Scholar
George A Miller. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2(95):100, 1955.Google Scholar
Radhakrishnan Nagarajan, Marco Scutari, and Sophie Lèbre. Bayesian networks in r. Springer, 122:125--127, 2013.Google Scholar
Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004.Google ScholarDigital Library
Eric Neufeld and Sonje Kristtorn. Whether non-correlation implies non-causation. In FLAIRS Conference, pages 772--777, 2005.Google Scholar
WM Patefield. Algorithm as 159: an efficient method of generating random r× c tables with given row and column totals. Journal of the Royal Statistical Society. Series C (Applied Statistics), 30(1):91--97, 1981.Google Scholar
Judea Pearl. {bayesian analysis in expert systems}: Comment: Graphical models, causality and intervention. Statistical Science, 8(3):266--269, 1993.Google ScholarCross Ref
Judea Pearl. Direct and indirect effects. In Proceedings of the seventeenth conference on uncertainty in artificial intelligence, pages 411--420. Morgan Kaufmann Publishers Inc., 2001. Google ScholarDigital Library
Judea Pearl. Causality. Cambridge university press, 2009.Google Scholar
Judea Pearl. An introduction to causal inference. The international journal of biostatistics, 6(2), 2010.Google Scholar
Judea Pearl. Simpson's paradox: An anatomy. Department of Statistics, UCLA, 2011.Google Scholar
Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 2014.Google ScholarDigital Library
Judea Pearl et al. Causal inference in statistics: An overview. Statistics Surveys, 3:96--146, 2009.Google ScholarCross Ref
Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560--568. ACM, 2008. Google ScholarDigital Library
Jean-Philippe Pellet and André Elisseeff. Using markov blankets for causal structure learning. Journal of Machine Learning Research, 9(Jul):1295--1342, 2008. Google ScholarDigital Library
Dataset: Airline On-Time Performance. http://www.transtats.bts.gov/.Google Scholar
Clark Glymour Peter Spirtes and Richard Scheines. Causation, Prediction and Search. MIT Press, 2001.Google ScholarCross Ref
Donald B Rubin. The Use of Matched Sampling and Regression Adjustment in Observational Studies. Ph.D. Thesis, Department of Statistics, Harvard University, Cambridge, MA, 1970.Google Scholar
Donald B Rubin. Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81(396):961--962, 1986.Google Scholar
Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.Google Scholar
Florian Tramer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, and Huang Lin. Fairtest: Discovering unwarranted associations in data-driven applications. In Security and Privacy (EuroS&P), 2017 IEEE European Symposium on, pages 401--416. IEEE, 2017.Google Scholar
Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376--380, 2003.Google Scholar
Jennifer Valentino-Devries, Jeremy Singer-Vine, and Ashkan Soltani. Websites vary prices, deals based on users' information. Wall Street Journal, 10:60--68, 2012.Google Scholar
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 325--333, 2013. Google ScholarDigital Library
Indre vZliobaite, Faisal Kamiran, and Toon Calders. Handling conditional discrimination. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 992--1001. IEEE, 2011. Google ScholarDigital Library

Index Terms

Bias in OLAP Queries: Detection, Explanation, and Removal
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Online analytical processing engines
  2. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

Towards intensional answers to OLAP queries for analytical sessions
DOLAP '12: Proceedings of the fifteenth international workshop on Data warehousing and OLAP

One of the problems in analyzing large multidimensional databases through OLAP sessions is that decision makers can be overwhelmed by the size of query answers, while they need a concise summary of data. Intensional query answering can help by providing ...
Read More
Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

OLAP queries involve a lot of aggregations on a large amount of data in data warehouses. To process expensive OLAP queries efficiently, we propose a new method to rewrite a given OLAP query using various kinds of materialized views which already exist ...
Read More
A personalization framework for OLAP queries
DOLAP '05: Proceedings of the 8th ACM international workshop on Data warehousing and OLAP

OLAP users heavily rely on visualization of query answers for their interactive analysis of massive amounts of data. Very often, these answers cannot be visualized entirely and the user has to navigate through them to find relevant facts.In this paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithmic faireness
biased query
causal inference
olap
simpson's paradox
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 916
  Total Downloads
- Downloads (Last 12 months)142
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bias in OLAP Queries: Detection, Explanation, and Removal

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards intensional answers to OLAP queries for analytical sessions

Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

A personalization framework for OLAP queries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Bias in OLAP Queries: Detection, Explanation, and Removal

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards intensional answers to OLAP queries for analytical sessions

Finding an efficient rewriting of OLAP queries using materialized views in data warehouses

A personalization framework for OLAP queries

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media