Skip to main content
Top

2017 | Book

Transparent Data Mining for Big and Small Data

insite
SEARCH

About this book

This book focuses on new and emerging data mining solutions that offer a greater level of transparency than existing solutions. Transparent data mining solutions with desirable properties (e.g. effective, fully automatic, scalable) are covered in the book. Experimental findings of transparent solutions are tailored to different domain experts, and experimental metrics for evaluating algorithmic transparency are presented. The book also discusses societal effects of black box vs. transparent approaches to data mining, as well as real-world use cases for these approaches.As algorithms increasingly support different aspects of modern life, a greater level of transparency is sorely needed, not least because discrimination and biases have to be avoided. With contributions from domain experts, this book provides an overview of an emerging area of data mining that has profound societal consequences, and provides the technical background to for readers to contribute to the field or to put existing approaches to practical use.

Table of Contents

Frontmatter
Erratum
Tania Cerquitelli, Daniele Quercia, Frank Pasquale

Transparent Mining

Frontmatter
The Tyranny of Data? The Bright and Dark Sides of Data-Driven Decision-Making for Social Good
Abstract
The unprecedented availability of large-scale human behavioral data is profoundly changing the world we live in. Researchers, companies, governments, financial institutions, non-governmental organizations and also citizen groups are actively experimenting, innovating and adapting algorithmic decision-making tools to understand global patterns of human behavior and provide decision support to tackle problems of societal importance. In this chapter, we focus our attention on social good decision-making algorithms, that is algorithms strongly influencing decision-making and resource optimization of public goods, such as public health, safety, access to finance and fair employment. Through an analysis of specific use cases and approaches, we highlight both the positive opportunities that are created through data-driven algorithmic decision-making, and the potential negative consequences that practitioners should be aware of and address in order to truly realize the potential of this emergent field. We elaborate on the need for these algorithms to provide transparency and accountability, preserve privacy and be tested and evaluated in context, by means of living lab approaches involving citizens. Finally, we turn to the requirements which would make it possible to leverage the predictive power of data-driven human behavior analysis while ensuring transparency, accountability, and civic participation.
Bruno Lepri, Jacopo Staiano, David Sangokoya, Emmanuel Letouzé, Nuria Oliver
Enabling Accountability of Algorithmic Media: Transparency as a Constructive and Critical Lens
Abstract
As the news media adopts opaque algorithmic components into the production of news information it raises the question of how to maintain an accountable media system. One practical mechanism that can help expose the journalistic process, algorithmic or otherwise, is transparency. Algorithmic transparency can help to enable media accountability but is in its infancy and must be studied to understand how it can be employed in a productive and meaningful way in light of concerns over user experience, costs, manipulation, and privacy or legal issues. This chapter explores the application of an algorithmic transparency model that enumerates a range of possible information to disclose about algorithms in use in the news media. It applies this model as both a constructive tool, for guiding transparency around a news bot, and as a critical tool for questioning and evaluating the disclosures around a computational news product and a journalistic investigation involving statistical inferences. These case studies demonstrate the utility of the transparency model but also expose areas for future research.
Nicholas Diakopoulos
The Princeton Web Transparency and Accountability Project
Abstract
When you browse the web, hidden “third parties” collect a large amount of data about your behavior. This data feeds algorithms to target ads to you, tailor your news recommendations, and sometimes vary prices of online products. The network of trackers comprises hundreds of entities, but consumers have little awareness of its pervasiveness and sophistication. This chapter discusses the findings and experiences of the Princeton Web Transparency Project (https://​webtap.​princeton.​edu/​), which continually monitors the web to uncover what user data companies collect, how they collect it, and what they do with it. We do this via a largely automated monthly “census” of the top 1 million websites, in effect “tracking the trackers”. Our tools and findings have proven useful to regulators and investigatory journalists, and have led to greater public awareness, the cessation of some privacy-infringing practices, and the creation of new consumer privacy tools. But the work raises many new questions. For example, should we hold websites accountable for the privacy breaches caused by third parties? The chapter concludes with a discussion of such tricky issues and makes recommendations for public policy and regulation of privacy.
Arvind Narayanan, Dillon Reisman

Algorithmic Solutions

Frontmatter
Algorithmic Transparency via Quantitative Input Influence
Abstract
Algorithmic systems that employ machine learning are often opaque—it is difficult to explain why a certain decision was made. We present a formal foundation to improve the transparency of such decision-making systems. Specifically, we introduce a family of Quantitative Input Influence (QII) measures that capture the degree of input influence on system outputs. These measures provide a foundation for the design of transparency reports that accompany system decisions (e.g., explaining a specific credit decision) and for testing tools useful for internal and external oversight (e.g., to detect algorithmic discrimination). Distinctively, our causal QII measures carefully account for correlated inputs while measuring influence. They support a general class of transparency queries and can, in particular, explain decisions about individuals and groups. Finally, since single inputs may not always have high influence, the QII measures also quantify the joint influence of a set of inputs (e.g., age and income) on outcomes (e.g. loan decisions) and the average marginal influence of individual inputs within such a set (e.g., income) using principled aggregation measures, such as the Shapley value, previously applied to measure influence in voting.
Anupam Datta, Shayak Sen, Yair Zick
Learning Interpretable Classification Rules with Boolean Compressed Sensing
Abstract
An important problem in the context of supervised machine learning is designing systems which are interpretable by humans. In domains such as law, medicine, and finance that deal with human lives, delegating the decision to a black-box machine-learning model carries significant operational risk, and often legal implications, thus requiring interpretable classifiers. Building on ideas from Boolean compressed sensing, we propose a rule-based classifier which explicitly balances accuracy versus interpretability in a principled optimization formulation. We represent the problem of learning conjunctive clauses or disjunctive clauses as an adaptation of a classical problem from statistics, Boolean group testing, and apply a novel linear programming (LP) relaxation to find solutions. We derive theoretical results for recovering sparse rules which parallel the conditions for exact recovery of sparse signals in the compressed sensing literature. This is an exciting development in interpretable learning where most prior work has focused on heuristic solutions. We also consider a more general class of rule-based classifiers, checklists and scorecards, learned using ideas from threshold group testing. We show competitive classification accuracy using the proposed approach on real-world data sets.
Dmitry M. Malioutov, Kush R. Varshney, Amin Emad, Sanjeeb Dash
Visualizations of Deep Neural Networks in Computer Vision: A Survey
Abstract
In recent years, Deep Neural Networks (DNNs) have been shown to outperform the state-of-the-art in multiple areas, such as visual object recognition, genomics and speech recognition. Due to the distributed encodings of information, DNNs are hard to understand and interpret. To this end, visualizations have been used to understand how deep architecture work in general, what different layers of the network encode, what the limitations of the trained model was and how to interactively collect user feedback. In this chapter, we provide a survey of visualizations of DNNs in the field of computer vision. We define a classification scheme describing visualization goals and methods as well as the application areas. This survey gives an overview of what can be learned from visualizing DNNs and which visualization methods were used to gain which insights. We found that most papers use Pixel Displays to show neuron activations. However, recently more sophisticated visualizations like interactive node-link diagrams were proposed. The presented overview can serve as a guideline when applying visualizations while designing DNNs.
Christin Seifert, Aisha Aamir, Aparna Balagopalan, Dhruv Jain, Abhinav Sharma, Sebastian Grottel, Stefan Gumhold

Regulatory Solutions

Frontmatter
Beyond the EULA: Improving Consent for Data Mining
Abstract
Companies and academic researchers may collect, process, and distribute large quantities of personal data without the explicit knowledge or consent of the individuals to whom the data pertains. Existing forms of consent often fail to be appropriately readable and ethical oversight of data mining may not be sufficient. This raises the question of whether existing consent instruments are sufficient, logistically feasible, or even necessary, for data mining. In this chapter, we review the data collection and mining landscape, including commercial and academic activities, and the relevant data protection concerns, to determine the types of consent instruments used. Using three case studies, we use the new paradigm of human-data interaction to examine whether these existing approaches are appropriate. We then introduce an approach to consent that has been empirically demonstrated to improve on the state of the art and deliver meaningful consent. Finally, we propose some best practices for data collectors to ensure their data mining activities do not violate the expectations of the people to whom the data relate.
Luke Hutton, Tristan Henderson
Regulating Algorithms’ Regulation? First Ethico-Legal Principles, Problems, and Opportunities of Algorithms
Abstract
Algorithms are regularly used for mining data, offering unexplored patterns and deep non-causal analyses in what we term the “classifying society”. In the classifying society individuals are no longer targetable as individuals but are instead selectively addressed for the way in which some clusters of data that they (one or more of their devices) share with a given model fit in to the analytical model itself. This way the classifying society might bypass data protection as we know it. Thus, we argue for a change of paradigm: to consider and regulate anonymities—not only identities—in data protection. This requires a combined regulatory approach that blends together (1) the reinterpretation of existing legal rules in light of the central role of privacy in the classifying society; (2) the promotion of disruptive technologies for disruptive new business models enabling more market control by data subjects over their own data; and, eventually, (3) new rules aiming, among other things, to provide to data generated by individuals some form of property protection similar to that enjoyed by the generation of data and models by businesses (e.g. trade secrets). The blend would be completed by (4) the timely insertion of ethical principles in the very generation of the algorithms sustaining the classifying society.
Giovanni Comandè
AlgorithmWatch: What Role Can a Watchdog Organization Play in Ensuring Algorithmic Accountability?
Abstract
In early 2015, Nicholas Diakopoulos’s paper “Algorithmic Accountability Reporting: On the Investigation of Black Boxes” sparked a debate in a small but international community of journalists, focusing on the question how journalists can contribute to the growing field of investigating automated decision making (ADM) systems and holding them accountable to democratic control. This started the process of a group of four people, consisting of a journalist, a data journalist, a data scientist and a philosopher, thinking about what kind of means were needed to increase public attention for this issue in Europe. It led to the creation of AlgorithmWatch, a watchdog and advocacy initiative based in Berlin. Its challenges are manyfold: to develop criteria as a basis for deciding what ADM processes to watch, develop criteria for the evaluation itself, come up with methods of how to do this, to find sources of funding for it, and more. This chapter provides first thoughts on how AlgorithmWatch will tackle these challenges, detailing its “ADM manifesto” and mission statement, and argues that there is a developing ecosystem of initiatives from different stakeholder groups in this rather new field of research and civil engagement.
Matthias Spielkamp
Metadata
Title
Transparent Data Mining for Big and Small Data
Editors
Tania Cerquitelli
Daniele Quercia
Frank Pasquale
Copyright Year
2017
Electronic ISBN
978-3-319-54024-5
Print ISBN
978-3-319-54023-8
DOI
https://doi.org/10.1007/978-3-319-54024-5

Premium Partner