1 Introduction
2 Background and related work
2.1 Definitions of software fairness and related problems
Name | Description |
---|---|
Fairness definitions based on probability of predictions a.k.a. group fairness definitions | |
Group fairness, a.k.a. statistical parity | According to this definition, a binary machine learner is fair if individuals of a protected group defined based on a protected attribute are associated to the positive class with the same probability of assigning the individuals of non-protected groups. |
Predictive Parity a.k.a. outcome test | According to this definition, a binary machine learner is fair if both protected and non-protected groups have the same probability to assume the real value of classification, i.e., the probability of individuals to be classified as true positive/negative is the same. |
False positive error rate balance a.k.a. predictive equality | According to this definition, a binary machine learner is fair if the probability of individuals being associated with the positive class even though the real value of classification is opposite is the same for both protected and non-protected groups, i.e., the probability of being a false positive is the same. |
False negative error rate balance a.k.a. equal opportunity | According to this definition, a binary machine learner is fair if the probability of individuals being associated with the negative class even though the real value of classification is opposite is the same for both protected and non-protected groups, i.e., the probability of being a false negative is the same. |
Equalized odds a.k.a. disparate mistreatment | This definition combines the previous two (predictive equality ad equal opportunity), considering a binary classifier as fair if both, the protected and unprotected groups have the same rates of true positive and false positive instances. |
Fairness definitions based on similarity of instances a.k.a. individual fairness definitions | |
Causal Discrimination | According to this definition, a binary machine learner is fair if it produces the same classification for any two subjects with the same attributes. For instance, this implies that male and female applicants who otherwise have the same attributes will either both be hired or not. |
Fairness through unawareness | According to this definition, a binary machine learner is fair if no sensitive attributes are explicitly used in the decision-making process. |
Fairness through awareness | According to this definition, a binary machine learner is fair if similar individuals obtain similar classification results (It is the more elaborated and generic similarity definition of fairness). |
Fairness definitions based on the concept of causal reasoning | |
No unresolved discrimination | According to this definition, a binary machine learner is fair, if the relative causal dependency graph does not present any path among sensitive attributes and learner results. |
Counterfactual fairness | According to this definition, a binary machine learner is fair, if the relative causal dependency graph does not present any indirect path among sensitive attributes and learner results. |
2.2 State of the art
Secondary study | |
---|---|
Mehrabi et al., A survey on bias and fairness in machine learning (Mehrabi et al. 2021) | |
Summary: | Similarities and Differences: |
They analyzed the existing literature and defined two taxonomies of (1) the most common fairness and bias definitions, and (2) the state-of-art strategies that researchers proposed to mitigate unfair outcomes in different machine learning application domains. | –Similar objective: Investigate and generalize knowledge on the treatment of fairness in terms of definitions, metrics, strategies and causes of bias; |
–Different method and target context: We perform a large-scale survey study involving practitioners working on ML-Intensive systems. | |
Pessach and Shmueli, A Review on Fairness in Machine Learning (Pessach and Shmueli 2022) | |
They performed a systematic literature review focusing on classification tasks and discussed trade-offs between fairness and model accuracy, categorizing fairness-enhancing mechanisms in pre-processing, in-preprocessing, and post-processing approaches, depending on when they should be applied. | –Similar scope: Investigate and systematize knowledge about fairness treatment strategies, metrics, and trade-offs; |
–Different method and goal: We are interested in understanding how fairness is perceived and treated by practitioners with respect to other six non-functional requirements, by performing a large-scale survey study. | |
Pagano et al., Bias and Unfairness in Machine Learning Models (Pagano et al. 2023) | |
They conducted a systematic literature review to collect the most used datasets, metrics, techniques, and tools to detect and mitigate bias. | –We rely on the findings of Pagano et al. to understand the available tools, and we assess whether they are currently leveraged in the practice. |
–Different method and target context: We perform a survey study involving practitioners. | |
Le Quy et al., A survey on datasets for fairness-aware machine learning (Le Quy et al. 2022) | |
They analyzed datasets provided in the literature to investigate the relationships between protected attributes and class attributes via Bayesian networks. | –Different method and granularity: We conduct a large-scale survey study and collect practitioners’ insights on the common causes of bias in working projects. |
Bird et al., Fairness-Aware Machine Learning [...] (Bird et al. 2019) | |
They drew an overview of the lessons learned in the literature and provided the community with a research road map toward a fairness-first approach, a new development way to manage fairness since the first stages of a typical machine learning development process. | –Different method and goal: We perform a survey study to collect practitioners’ opinions that we believe will be useful to formulate theoretical fairness-oriented development frameworks. |
Xivuri and Twinomurinzi, A Systematic Review of Fairness in AI Algorithms (Xivuri and Twinomurinzi 2021) | |
They analyzed 47 papers to understand how the research community dealt with fairness in terms of method, domains, practices, and locations. They highlighted how fairness is currently more focused on technical and social/human aspects rather than the economic ones, and how most studies have been conducted in Europe and North America. | Similar broadness: We consider different geographical locations, application domains, people’s levels of experience and backgrounds, roles and job responsibilities; |
Different method and target context: We perform a survey study involving practitioners. | |
Shrestha and Das, Exploring gender biases in ML and AI academic research [...] (Shrestha and Das 2022) | |
They analyzed 120 papers in the context of a systematic literature review on gender bias in machine learning and artificial intelligence, warning that gender-related biases are less explored and require more attention by the research community. | –Different method and scope: We perform a large-scale survey study to understand in which phases of a typical machine learning pipeline the practitioners adopt specific strategies to deal with all kinds of bias and ethical issues. |
Catania et al., Fairness & friends in the data science era (Catania et al. 2022) | |
They analyzed the existing literature to assess how researchers investigated unethical behaviour of data-driven automated decision systems in the context of complex data science pipelines. | –Different method and goal: We conduct a survey with practitioners to gain understanding of how fairness is treated in the practice, throughout the whole development process. |
Madaio et al., Assessing the Fairness of AI Systems [...] (Madaio et al. 2022) | |
They conducted semi-structured interviews and structured workshops with 33 AI practitioners to understand their perspectives on processes, challenges, and needs in the machine learning system development process. | –Similar target context: We are interested in grasping the practitioners’ perspectives on the state of the practice; |
–Different method and generalizability of results: We perform a large-scale survey study involving respondents with variegated backgrounds. | |
Fabris et al., Algorithmic Fairness Datasets: the Story so Far (Fabris et al. 2022) | |
By surveying the literature, they developed a structured ontology of more than 250 datasets, that have been employed for different fairness-critical tasks in over 30 different application domains. | –We leverage Fabris et al.’s findings to design our research materials, i.e., the survey questions, considering the fairness-critical application domains listed in the ontology, and providing the participants with the possibility to list additional context they work in; |
–Different method and goal: We conduct a survey to understand the resources and strategies involved in a fairness-critical development scenario. | |
Saha et al., Measuring non-expert comprehension of ML fairness metrics (Saha et al. 2020) | |
They proposed a metric to represent the non-experts’ comprehension of specific statistical fairness definitions, exploring the relationship between comprehension, sentiment, demographics, and the definitions themselves. They validated the metrics via an online survey with non-expert participants, to test its reliability over three specific fairness statistical definitions, i.e., demographic parity, equal opportunity, and equalized odds. | –Similar method: We administered an online survey to involve industrial practitioners; |
–Different goal and target context: We surveyed experts practitioners with experience on fairness-critical machine learning projects to collect information on multiple aspects, i.e., the clarity, usefulness and applicability of different fairness notions, and how fairness is relevant with respect to other non-functional attributes in a typical industrial context. |
2.3 Considerations on the State of the Art
3 Empirical study design
3.1 Research questions
3.2 Research method and study variables
-
Variables involved in RQ\(_1\). Our first research question was aimed at understanding how the notions of fairness are perceived by practitioners. Therefore, the independent variable was the notion of fairness, a categorical variable assuming three values, i.e., the main kinds of definitions provided by previous literature (Verma and Rubin 2018). We identified three dependent variables involved in this research question, i.e., the degree of clarity, usefulness, and feasibility of the definition.
-
Variables involved in RQ\(_2\). Since RQ\(_2\) pushed us to understand the relevance of fairness compared to other quality characteristics, we identified such quality aspects as the independent variable and the compared relevance of fairness as the dependent variable. We recognized that the relevance of fairness is dependent on the application domain it is considered into; therefore, we identified it as an independent variable to take into account, to gain a detailed understanding of how fairness is deemed important in different contexts.
-
Variables involved in RQ\(_3\). Our third research question was targeted at understanding in which phases of an ML pipeline is it relevant to adopt strategies and employ tools to guarantee proper levels of fairness. The investigation driven by such a question was three-fold. Firstly, we were interested in comprehending in which phases of the pipeline is it relevant to take action; therefore, the first independent variable we identified was the pipeline stage, and the dependent variable consisted in the extent to which is relevant to take into account fairness in the stage. Secondly, we wanted to assess what are the strategies and tools currently employed in the state-of-the-practice; hence, we considered the available tools as the independent variable and the extent to which they are used in practice as the dependent variable. Lastly, we were concerned about understanding in which specific phases of the ML pipeline the tools are actually useful. In this third part of the investigation, the tools again acted as the independent variable, and the phases in which is useful to employ the tools represent the dependent variable.
-
Variables involved in RQ\(_4\). As RQ\(_4\) drove us to understand the desirable composition of a team working on ML-Intensive fairness-critical solutions, it was worth investigating (1) which professional roles should be involved in each phase of the development, and (2) which collaborations among roles are crucial to guarantee proper levels of fairness. Therefore, in the context of the first part of the investigation, we identified the professional role and the pipeline phase as the independent variables, and the importance of leveraging the professional role in the phase as the dependent variable. For the second part of the investigation, we considered the pairs of professional roles as the independent variable, and the relevance of the collaboration between roles as the dependent variable.
-
Counfounding factors. For all research questions, the confounding factors were represented by the practitioners’ education, company size, and level of experience with fairness.
-
Treatment. Our study included one treatment, namely the administration of a survey to gather information on dependent variables and confounding factors.
Sub section | Topic | Relative RQs |
---|---|---|
Section - Survey Introduction | ||
Welcome Page | Explain the main topics of the study, introduce the research team, and gather consent to treat and process personal data. | - |
Section - Participant’s Background | ||
Personal Information | Gather personal and demographic information for data analysis. | RQ1, RQ2, RQ3, RQ4 |
Employment Information | Collecting working experience data for data analysis. | RQ1, RQ2, RQ3, RQ4 |
Section - Personal experience with fairness | ||
Formal Definitions Used | Surveys practitioners’ opinions and expertise with different notions of fairness | RQ1 |
Application domains and tasks | Collect information about individuals’ expertise in fairness treatment across various application domains and machine learning tasks. | RQ1, RQ2, RQ3, RQ4 |
Section - Fairness in practice | ||
Quality trade-offs between fairness and other ML non-functional properties | Evaluate the relevance of fairness in comparison to other non-functional aspects of machine learning quality across different application domains. | RQ2 |
Dealing with fairness among different steps of a typical machine learning pipeline | Collect information about the importance of addressing machine learning fairness across different phases of a well-defined machine learning pipeline. | RQ3 |
Development tools to deal with fairness among different pipeline steps | Gather insights on well-known development tools and resources used to manage fairness in different steps of a machine learning pipeline. | RQ3 |
Relevant roles to deal with fairness among different pipeline steps | Gather information on the relevance of fairness and the need for collaborations among various stakeholders in different stages of a machine learning pipeline. | RQ4 |
Section - Conclusions | ||
Greetings and Future Research(s) Invitation | - | - |
3.3 Survey design
Quality aspect | Definition |
---|---|
Usability | The ability of a machine learning module to provide appropriate conditions for system users to perform the tasks for which it was designed. |
Reliability | The probability with which a machine learning module executes its tasks over time for a specific number of users without failure. |
Performance | The ability of a machine learning module to perform actions considering well-defined speed and time constraints. |
Accuracy | The level of data points correctly predicted by a machine learning module, compared to the total number of predictions made. |
Security & Safety | The ability of the learning module to detect the risks of malicious attackers potentially damaging the system and prevent failures or vulnerabilities that make the iter system potentially dangerous. |
Maintainability & Retraining | The degree to which a machine learning module can be modified to be adapted to changes in the usage environment or can be re-trained with a new training set resulting from changes in the usage environment. |
-
We invited participants to share information that can be covered by strict business restrictions, so we remarked that the survey compilation could be left at any time, nullifying the final submission;
-
We guaranteed the practitioners’ privacy, without using the collected information, if not for the explicit goals reported in the starting section of the survey. In any case, all direct references to people were anonymized before the analysis of the results;
-
Participants were not obliged to share with us any of their sensitive business information. For this reason, we always provided participants with the chance of selecting the “Prefer not to say” option in every question;
-
Questions asking for potentially sensitive information, e.g., gender or age, were all made optional.
3.4 Survey validation
3.5 Survey administration and responses
-
Software Engineers;
-
Data Scientists;
-
Data & Feature Engineers;
-
Junior or Senior machine learning engineers;
-
Junior or Senior Managers of machine learning systems;
-
Fluent knowledge of English;
-
Work Sector: Information Technology, Science, Technology, Engineering or Mathematics;
-
Experience and motivation at work: Excited and highly motivated;
-
Study level: Diploma or higher;
-
Required skills in computer programming;
-
Required knowledge of data science-related programming languages.
4 Data collection and analysis
4.1 Data quality prescreening
-
Six practitioners did not provide enough personal background details to prove their experience in machine learning development;
-
32 practitioners explicitly declared to never worked on fairness-critical machine learning projects;
-
12 participants failed one or both of the attention checks and provided lazy non-sense answers to open-ended questions.
4.2 Analysis strategies
5 Analysis of the results
5.1 Participants’ background
5.1.1 Experience in dealing with ML fairness
5.2 RQ1 - Definition of fairness in industrial contexts
5.2.1 Further insights complementing RQ1
5.3 RQ2 - On the relevance of software fairness in the development process
5.3.1 Further insights complementing RQ2
5.4 RQ3 - On the engineering of software fairness within an MLOps pipeline
No Idea | Data Engineering | Model Engineering | Performance and Quality Monitoring | Analysis and Experimentation | |
---|---|---|---|---|---|
IBM’s AI Fairness 360 | 46 | 23 | 29 | 42 | 24 |
Google’s What-If Tool | 26 | 26 | 36 | 49 | 34 |
Microsoft Fairlearn | 31 | 37 | 32 | 43 | 38 |
PwC Responsible AI Toolkit | 46 | 16 | 32 | 36 | 24 |
TensorFlow’s Fairness Indicators | 29 | 29 | 44 | 47 | 32 |
5.4.1 Further insights complementing RQ3
5.5 RQ4 - On the team composition for the development of fair ML-intensive systems
5.5.1 Machine learning pipeline
5.5.2 Further insights complementing RQ4
6 Discussion & implications
6.1 On the tight connection between fairness and its application domain
6.2 Toward the standardization of fairness in a machine learning pipeline
6.3 Defining and managing ML fairness: theory versus practice
6.4 To what extent can fairness be compared to other quality aspects?
6.5 Towards process-driven approaches for ML Fairness
7 Threats to validity
7.1 Conclusion validity
-
Reliability of measures. The design quality of our survey was particularly critical to answer our research questions properly. We defined structured multiple-choice questions, for which the possible answers could be values in a range or multiple-selection values. Considering the qualitative nature of the study, we also included open questions—one for each relevant closed-ended question—through which participants might have provided further insights on the matter.
-
Reliability of treatment. Still considering that a low-quality survey design could produce unreliable results, we directly included specific filtering instruments, i.e., attention-check questions, that allowed us to conduct a deep quality pre-screening process before the formalization of the results, and alternative flows, that allowed practitioners to respond to domain-specific questions only in relation to their expertise in the field.
-
Sample representativeness. Another relevant aspect that could have affected the reliability of the results is the level of representativeness that the sample has with respect to the study population or specific subgroups of individuals. In our case, this threat was mainly connected to the domain-specific questions, e.g., comparisons between fairness and other quality aspects in different machine learning application domains. To face this threat, we first collected information on the past experience of the respondents with machine learning fairness and the application domains they were involved in; afterwards, we made sure to ask practitioners to answer only those domain-specific questions related to their own expertise. This increased our confidence in the reliability of answers, and the representativeness of the considered sample to gather insights related to specific application domains.
-
Sample size. The number of respondents might have affected the reliability of the results. We aimed at collecting responses from at least 100 experienced participants. The overall amount of responses obtained, i.e., 117, is comparable to similar studies in the field of empirical software engineering (Rafi et al. 2012; Palomares et al. 2017; Garousi et al. 2017). In addition, the filters set and the performed quality assessment checks allowed us to rely on practitioners having experience in the development of fairness-critical systems. As such, we are confident that the findings of the paper well represent the more general perspective of practitioners with respect to the engineering of machine learning fairness. Yet, we still acknowledge that in the cases of the domain-specific questions we had to analyze smaller sub-samples: as such, further replications of our work would be desirable to corroborate our findings.
7.2 Construct validity
-
Hypothesis guessing. Short and badly designed survey may cause hypothesis guessing: in response to this threat, we structured our survey contents following an ad-hoc logical flow, in order to insert in the same survey sections questions that are logically related to more than one research question. In addition, we made sure to phrase the survey questions in a way that they would not bias the practitioners’ ideas and responses—in this sense, the survey validation step conducted through the pilot test increased our confidence in the survey design quality.
-
Level of Constructs. Since we used an administration platform that involved practitioners with different backgrounds and expertise, the results could be affected by the poor quality and comprehensibility of the survey and the skills required to deal with the specific subject under study. To guarantee the highest level of comprehensibility, we formulated all questions in English, avoiding long sentences and technical vocabulary. In addition, we encouraged participation, emphasizing the need for specific knowledge in the areas of Software Engineering or Artificial intelligence and fairness-critical machine learning development.
-
Mono-Method Bias. Considering that we collected information through a single focused survey to validate contents and duration, we conducted a pilot test with Ph.D. students in Computer Science, which highlighted some criticisms about the survey contents and required time that could have impacted our research findings. We solved the raised issues by clarifying the terminology used, particularly when referring to technical concepts and definitions, improving the phrasing of some questions, and better clarifying the study objectives and the data management policies, other than by fixing minor issues, like typos and/or grammar mistakes.
-
Experience Bias. Another possible threat to construct validity is connected to the potential misalignments in experience between Ph.D. students who took part in the pilot test (only leveraging theoretical knowledge) and practitioners we surveyed to answer our research questions (benefitting from theoretical knowledge augmented by experience). As such, practitioners may have taken less time to complete the study due to their experience, or perhaps more time given the higher level of detail they were able to provide. To partially mitigate this threat and ensure that the required time of completion measured in the pilot test was in line with the actual completion time taken by the practitioners in the survey, we involved five Ph.D. students working on research themes connected to machine learning engineering and software fairness—one of them also had a three-year previous working experience as a data scientist. The reliance on those Ph.D. students allowed us to mitigate the risk of misalignment between pilot testers and our target population.
-
Time-Efficiency Bias. Through the pilot test, we estimated that the mean time required to complete the survey ranged between 15 and 20 minutes; extreme outliers of such a range might have indicated participants not taking the task seriously, hence potentially biasing our results. To avoid errors in the estimation, we assessed post-execution of the survey that the actual mean time of completion by practitioners was close to 20 minutes. In addition, we manually verified the reliability of all submissions received, paying special attention to the responses coming from practitioners who took too much or too little time.
7.3 Internal validity
-
Selection. In an empirical investigation, people recruited could negatively influence the study if demotivated or not adequately rewarded, other than introducing self-selection or voluntary response bias. Under Prolific policies, the participants were recruited on a voluntary basis and encouraged through a financial reward; this reward can be seen as an incentive. Incentives are well-known to increase the response rate, as shown in previous studies targeting the methods to increase response rate in survey studies (Avdeyeva and Matland 2013; Church 1993; Smith et al. 2019)—the employed online recruitment platform itself, i.e., Prolific, is built upon such observations.
-
Incentive Bias. The use of rewards may even become detrimental, as they may introduce the possibility of systematic bias in how practitioners respond. As recommended by previous work (Ebert et al. 2022), we mitigated this risk through the implementation of multiple strategies, including (1) the specification of exclusion criteria through the Prolific built-in filters, which allowed us to filter out individuals who did not meet the minimum requirements for participation, i.e., motivation at work, fluency in English, ability to work with data science-related programming languages, and employment in the computer science field(s); (2) the presence of attention checks, which may test the participants’ attention, hence possibly spotting cases where participants answered the survey questions just to obtain a reward, without providing detailed feedback; and (3) a data quality assessment made upon survey completion, where we manually verified the time spent by each participant on the survey and the quality of the responses provided, in an effort of spotting cases where participants neglected the task. While we cannot estimate the number of participants filtered out because of strategy (1)—the Prolific built-in filters only provide an estimation of the prospective users who may participate in the survey—over 45% of the responses were manually removed upon completion as a consequence of the application of strategies (2) and (3). In other terms, we did our best to exclude answers characterized by evident poor relevance and quality and build our work upon reliable opinions. We are aware that some bias due to incentivizing participants can be still present, which is why we call for replications of our survey study.
-
Mutation & History. Considering that the administration phase of our study required more than one day to be completed, the consistency of the collected information could be affected during the administration period. We recognized the necessity to constantly monitor the evolution of the gathered data, therefore we periodically checked the received responses according to our quality pre-screening choices.
7.4 External validity
-
Limited Generalizability. Our results and the conclusions drawn in our study are strictly related to the sample, hence being potentially affected by limited generalizability. When conducting large-scale survey studies concerning domain and task-specific topics, like machine learning fairness, it is critical to gather opinions from experienced practitioners having different backgrounds and working in different domains. As a consequence of the selection procedures employed and the quality checks performed, we may argue that our findings reasonably reflect the opinions of practitioners involved in multiple fairness-critical application domains and engaging with different machine-learning tasks. However, we cannot claim that our results hold for the largest population of machine learning engineers working on fairness-critical domains that were not represented in the sample. Additional studies would be needed, particularly investigating possible differences and implications in underrepresented application domains.
-
Professional Environment Effect. Another factor possibly impacting the generalizability of the sample may be connected to the professional environment participants are involved in. For instance, companies implementing explicit strategies or policies to deal with fairness might train their employees on the matter; this might result in practitioners being more aware of the problem, its impact, and the practices to apply to deal with it. We did our best to collect background information that may properly describe the sample taken into account in our study; nonetheless, we still acknowledge that some external factors may have affected our conclusions. Our study may be therefore considered as a baseline for further investigations into the matter. On the one hand, replications in specific application domains not targeted by our work might be beneficial to extend the knowledge of how fairness is managed in practice. On the other hand, researchers working in the field may complement and broaden the current body of knowledge through different research methods, e.g., software repository mining analyses of how machine learning engineers deal with fairness requirements.
-
Geographical Dispersion. Looking at the demographics of our survey participants, we could observe that most of them were from Europe. As such, our results may be biased toward the working habits of Europeans. To assess this potential threat to validity in the context of our study, we performed an additional analysis only considering the answers provided by non-European practitioners to see whether they were consistent with the general patterns observed. We found a significant consistency which allows us to argue that Europeans’ opinions indeed reflected a larger population’s view. However, we are aware that replications of the study targeting a more variegate socio-cultural and geographical background would be beneficial to corroborate our findings.