1 Introduction

Recommender systems help individuals explore vast catalogs of items. To this end, such systems adopt a model that implements a suitable way of ranking items. Conventionally, items are ranked in order of their decreasing relevance for a given user, estimated via machine learning. The literature traditionally focused on optimizing user–item relevance for user’s recommendation utility (Ricci et al. 2015). However, many recommendation scenarios involve multiple stakeholders and should account for the impact on more than one group of participants (Burke 2017; Boratto et al. 2021). For instance, the ranked lists may influence profits and plans of item providers (Jannach and Jugovac 2019).

Context: The motivation driving this paper is that a model, optimized for user’s recommendation utility, can introduce an indirect and unintentional discrimination for providers belonging to a legally protected minority class (e.g., when considering gender or ethnicity as a sensitive attribute) (Zliobaite 2017; Dwork et al. 2012). Given the primary role of recommender systems also for minority providers, having their items unfairly recommended would have human, ethical, social, and economic consequences (Ricci et al. 2015). Furthermore, due to these phenomena, providers might lose their trust in the platform and consequently leave it, impacting on the ecosystem as a whole. Hence, it is imperative to uncover, characterize, and mitigate discrimination inherent in the recommendation model, so that no platform systematically and repeatedly disadvantages minority providers.

Problem Statement: The literature in ranking and recommendation recently focused on aligning the exposure or the attention to providers with their relevance or contribution in the catalog, at individual or group level (Yang and Stoyanovich 2017; Liu et al. 2019; Kamishima et al. 2018; Biega et al. 2018). Our study encodes the idea of a group-level proportionality between the contribution in the catalog and the relevance, the visibility, and the exposure, following a distributive norm based on equity (Walster et al. 1973). Operationalizing this notion during the user–item relevance optimization stage may be envisioned as a proactive way of addressing provider’s fairness along the recommendation pipeline, given that relevance scores are the input for the final ranking stage. Being optimized for their ability to rank, the estimated relevance scores directly influence the chance of an item being ranked high (i.e., the higher the relevance is, the more likely the item appears at the top). If these relevances are biased against the minority group, the recommender system is unfairly giving minority items less chance of being ranked high. Given its connection with the final ranking, relevance is thus an internal algorithmic asset to be allocated to provider groups, and not just a property of user–item pairs to be estimated. Therefore, controlling relevances of items of a provider group can be a driver of recommendation outcomes with lower disparities.

Despite potentially bringing fairness-related benefits on the suggested lists by itself, controlling predicted relevance scores may also help to deal with situations where true expected relevances required by existing fairness-aware treatments are not available. Ensuring that a model improves its capability to deem the items of the minority as relevant is not trivial, since minority items tend to be under-represented in interactions. This may influence the predicted relevance and, in cascade, the recommendations involving minority providers. The disparate impact we address consists in items of a small minority group of providers systematically receiving unfairly low relevance and, by extension, an exposure not proportional to their contribution in the catalog. Our goal in this paper is thus to investigate whether, during the learning stage, taking actions for increasing the relevance of items from the minority group positively impacts on providers’ group fairness in recommendations.

Open Issues: While a range of frameworks to assess and mitigate provider unfairness has been introduced in the context of non-personalized people rankings (Biega et al. 2018; Singh and Joachims 2018; Lahoti et al. 2019) and item recommendation (Kamishima et al. 2018; Beutel et al. 2019), several issues remain open.

Despite being extendable to many-to-many item–provider associations, existing frameworks for provider fairness have been assessed on settings with a one-to-one association between items and providers (Beutel et al. 2019; Sapiezynski et al. 2019). This is natural in a people ranking, since the concepts of provider and item being ranked coincide. However, under a more general item recommendation scenario, items and providers may be linked by a many-to-many relationship (e.g., a movie having multiple directors or a director offering multiple movies). Hence, there is a need to assess how fair are recommendations for providers in the general context we described (e.g., for items having both female and male providers).

Furthermore, disparate exposure has been traditionally mitigated through a form of re-ranking, assuming to have access to true unbiased relevances (Singh and Joachims 2018; Biega et al. 2018). However, these relevances are typically estimated by means of a machine learning technique, leading to a possibly biased value of the relevance scores. Indeed, recommender systems are known to be biased from several perspectives (e.g., popularity, presentation, and, obviously, unfairness for users and providers). Predicting a relevance score on biased/unfair results and basing a re-ranking approach on a possibly biased relevance may lead to undesired effects, considering that relevance directly influences the chance of an item being ranked high. This issue is urging novel methods able to instill a fine-grained share of relevance across groups in the algorithmic mechanics. This would generate a tangible impact on disparity reduction in the final ranking.

To the best of our knowledge, no approach deals with controlling the balance of relevance estimations across provider groups, under the above scenario. Indeed, while in-processing regularizations of relevance exist (Kamishima et al. 2018; Beutel et al. 2019) and this would overcome the second issue, these treatments are fundamentally driven by a fairness objective different from ours, not relying on controlling the share of relevances, and still assessed on a one-to-one item–provider relationship.

Motivating Intuitions: The intuitions that drive our approach are depicted with concrete examples, taken from the MovieLens-10M dataset, presented in detail in Sect. 5.1.1. Considering a binary gender attributeFootnote 1 and using movie directors as a proxy of providers, female directors appear on the \(6.0\%\) of items in the catalog, but end up being under-represented with only \(3.9\%\) of interactions. With the pair-wise approach we employed in this work and the (un)fairness metric we will present, female providers receive \(2.9\%\) of the total item relevance (and \(2.8\%\) of exposure), being affected by the disparate impact.

Considering that the items having more interactions are more likely to receive high relevance and be recommended at the top of the ranking (i.e., the well-known popularity phenomenon), we investigated whether upsampling the interactions involving the minority group of providers, to reach a percentage aligned with their representation in the catalog (i.e., \(6.0\%\) of the total interactions), can reduce disparities in relevance and, by extension, in exposure. Giving the upsampled set of interactions as an input to the same pair-wise algorithm led to female providers receiving \(5.4\%\) of relevance and \(5.2\%\) of exposure, still far from the \(6.0\%\) of representation in the catalog. Given this gap, we then regularized for the share of relevances during the learning process, leveraging upsampled interactions (which are important to enable the regularization). This latter setting led to a \(5.9\%\) of relevance and \(5.8\%\) of exposure for female providers, reducing the initial disparate impact. These preliminary practical results motivated us to investigate how upsampling and regularization can lead to higher relevance and lower disparate exposure for the minority.

Contributions: Compared to prior work, both in the fairness metric and the mitigation, we consider a many-to-many relationship between items and their providers and assess the representation of each value of a sensitive attribute in a given item (i.e., we would assess how represented each gender is in that item). Under this scenario, to reduce disparities in relevance and exposure, we propose a pre-processing strategy that upsamples interactions where the minority group is predominant (e.g., an item where the minority is represented with two providers is better than item with only one provider of that group; moreover, the lower the representation of the majority in that item is, the more we can help the minority, by favoring an upsampling of these latter items). In addition, an in-processing component aims to control that the relevance given to the items of the minority group is proportional to the minority group contribution in the catalog. Our contribution is summarized as follows:

  • we characterize disparities in predicted relevance, visibility, and exposure against the minority group of providers and assess their existence on synthetic data that simulates diverse representations of the group in the catalog and the interactions, learning lessons that guide our mitigation;

  • we present a mitigation approach that relies on (i) tailored upsampling in pre-processing and (ii) a regularization term added to the original training optimization function, to operationalize our motivating intuitions;

  • we leverage two public datasets with gender information of the providers, enabling the consequent evaluation of the impact of our metrics and strategies on real-world datasets with very small minority groups.

Roadmap: The remaining of this paper is structured as follows: Sect. 2 formalizes key concepts and metrics, and Sect. 3 describes our exploratory analysis. Then, Sect. 4 introduces our mitigation approach, while Sect. 5 assesses its feasibility. Section 6 provides connections with prior work. Finally, Sect. 7 depicts concluding remarks and future research directions.

2 Concepts and definitions

In this section, we outline the recommendation scenario we seek to investigate and the concepts and definitions used throughout this paper.

2.1 Recommender system formalization

Given a set of users U, a set of items I, and a set of providers P, we assume that each item \(i \in I\) is jointly offered by a subset of providers \(P_i \subset P\), with \(|P_i| > 0\), and a provider \(p \in P\) offers a subset of items \(I_p \subset I\), with \(|I_p| > 0\). For instance, in the context of course recommendation, if we consider instructors as providers of course items, a course could have two instructors who give lectures cooperatively. Similarly, the same instructor could deliver three different courses on the platform, two of them cooperatively and one alone, just as an example. Each provider \(p \in P\) is associated with N discrete sensitive attributes \((a_1^p, a_2^p, \cdots , a_n^p)\), with \(a_1^p \in A_1 \subset {\mathbb {N}}\), \(\ldots\), \(a_n^p \in A_n \subset {\mathbb {N}}\). For instance, a set \(A_{j}\) could be associated with the gender attribute and, thus, being defined as \(A_j = \{0:female, 1:male, \dots \}\), assuming that an attribute is discrete and that we encoded each discrete value to a unique integer.

We assume that users have interacted with a subset of items in I. The collected feedback from user–item interactions can be abstracted to a set of pairs (u, i) obtained from the normal user’s activity or triplets (u, i, value), whose value is either provided by users (e.g., ratings) or computed by the system (e.g., frequency). In our study, we consider pairs derived from explicit feedback, by applying a pre-selected threshold to rating values, in order to model the recommendation task as a personalized ranking problem. We denote the user–item feedback matrix by \(R \in {\mathbb {R}}^{|U|*|I|}\), where \(R_{u,i} > 0\) indicates that user u interacted with item i, and \(R_{u,i}=0\) otherwise. Furthermore, we denote the set of items that user \(u\in U\) interacted with by \(I_u=\{i\in I\,:\,R_{u,i} > 0\}\).

We assume that each user \(u \in U\) and item \(i \in I\) is internally represented by a D-sized numerical vector from a user-vector matrix W and an item-vector matrix X, respectively. The recommender system’s task is to optimize \(\theta = (W,X)\) for predicting unobserved user–item relevance. It can be abstracted as learning \({\widetilde{R}}_{u,i} = f_{\theta }(u,i)\), where \({\widetilde{R}}_{u,i}\) denotes the predicted relevance, \(\theta\) denotes learnt user and item matrices, and f denotes the function predicting the relevance between \(W_u\) and \(X_i\). Given a user u, items \(i \in I \setminus I_u\) are ranked by decreasing \({\widetilde{R}}_{u,i}\), and top-k, with \(k\in {\mathbb {N}}\) and \(k>0\), items are recommended. Our study will focus on \(k=10\) recommendations per user, since they probably get the most attention of users and 10 is a widely employed cutoff. Finally, we denote the set of \(k\in {\mathbb {N}}\) items recommended to user u by \(\tilde{I}_u\).

2.2 Associating providers’ sensitive attributes to items

Formalizing our target notion of fairness for provider groups, under the scenario depicted in Sect. 2.1, requires to deal with several aspects. Fairness studies in ranking and recommendation traditionally targeted people as entities to be ranked or recommended (Biega et al. 2018; Yang and Stoyanovich 2017; Lahoti et al. 2019). While still having individuals being directly affected by how recommendations are generated, entities to be recommended are not always individuals and may include items (e.g., movies, courses). This turns out to key challenges that rise in cascade.

First, in many cases, there is no direct one-to-one mapping between an item and the individual who has created or offered it (i.e., the provider). Realistic scenarios need to consider items created by more than one provider cooperatively (e.g., a course with two instructors) and how the sensitive attributes are associated to the involved providers. It can be even difficult to come up with a one-to-many mapping for items offered by an entity not directly linked to individuals (e.g., a training company providing an online course).

Second, the fact that an item might have more than one provider behind it poses the problem of how to model the representation of a providers’ sensitive attribute, when considering that item (e.g., how each gender is represented in a given item), based on the individuals associated to it. Linking a unique variable, either binary or multi-class, discrete or continuous, to a sensitive attribute of a provider and claim fairness on such a variable is often impractical. More sophisticated solutions should be considered. For instance, the metrics proposed by Biega et al. (2018) and in the TREC Fair Track (Biega et al. 2020) have been devised to handle items with multiple providers with different attributes.

Based on these observations, we define a notion of sensitive attribute representation for an item i, subjected to a sensitive attribute A. This notion requires to consider the membership of each provider \(p \in P_i\) to a class of the sensitive attribute A (which we previously denoted as \(a^p\)), while mapping sensitive attributes to items.

Definition 1

(Sensitive attribute representation) Given a sensitive attribute \(A \subset {\mathbb {N}}\), the sensitive attribute representation \(s_i^A\) of an item i with respect to A is defined as:

$$\begin{aligned} s_i^A = [ \, |P_i^a| \,, \, \forall \, a \in A] \end{aligned}$$
(1)

where \(P_i^a\) is the set of i’s providers with attribute \(a \in A\). Each vector \(s_i^A\) has size |A| for all items \(i \in I\), and each of its values represents the number of providers who belong to a given class of the attribute A, ranging in \([0, |P_i|]\). Similarly to us, Sapyezinski et al. (Yang and Stoyanovich 2017) use a function to map each ranked item to a vector. However, their vector is used as a proxy of uncertainty, while assigning a sensitive attribute value to a person to be ranked (e.g., given a binary gender construct, if a system considers that a person is male with a probability of \(10\%\), the vector associated to that person is [0.10, 0.90]). Our notion differs both conceptually and operationally, as we model and compute how each value a sensitive attribute can assume is represented across providers associated to a given item, in magnitude. Furthermore, our notion could be extended to model uncertainty, while getting the value of the sensitive attribute associated with a provider, assumed by us to be \(a \in A \subset {\mathbb {N}}\). To better highlight our contribution, our study leaves this combination as a future work.

2.3 Identifying the minority Group

Our study considers groups of providers who belong to a given class of the attribute \(a \in A\). Each group is involved in the creation/delivering of a certain number of items in the catalog and, consequently, in a certain number of the user–item interactions. Specifically, given the definitions previously provided in Sect. 2.2, the representation of a group in the catalog and the interactions is computed in our study as follows:

Definition 2

(Provider group representation in the catalog) Given a sensitive attribute \(A \subset {\mathbb {N}}\), the representation of providers with a value of the sensitive attribute \(a \in A\) in the catalog, is defined as:

$$\begin{aligned} {\mathcal {C}}^a = \frac{1}{|I|} \sum _{i \in I} \frac{s_i^A(a)}{\sum _{{\widetilde{a}} \in A} s_i^A({\widetilde{a}})} \end{aligned}$$
(2)

where \(s_i^A(a)\) is the element of the vector \(s_i^A\) associated to the value a, as per definition in Eq. 1. The representation \({\mathcal {C}}^a\) ranges in [0, 1] and accounts for the contribution of providers belonging to a given group in the delivering of items in the catalog. A value close to 0 means that a’s providers rarely contribute to items in the catalog, and vice versa for values close to 1. Similarly, we define the representation of a provider group in the interactions.

Definition 3

(Provider group representation in the interactions) Given a sensitive attribute \(A \subset {\mathbb {N}}\), the representation of providers with a value of the sensitive attribute equal to \(a \in A\), in the interactions R, where \(M = \{(u,i) \, : \, R_{u,i}>0\}\) are the observed interactions, is defined as:

$$\begin{aligned} {\mathcal {O}}^a = \frac{1}{|M|} \sum _{(u,i) \in M} \frac{s_i^A(a)}{\sum _{{\widetilde{a}} \in A} s_i^A({\widetilde{a}})} \end{aligned}$$
(3)

In our study, we are interested in investigating how recommendation decisions impact on a group of providers identified as a minority. There exists different modalities to identify a minority group \(\text {a}_{\min }\), one of them being the lowest representation in the catalog, i.e., \(\text {a}_{\min } = argmin_{a \in A} {\mathcal {C}}_a\). This choice will better support us to account for differences in contribution among provider groups, assuming that the catalog curation does not suffer from sampling bias (e.g., a course platform that refuses to add courses given by female teachers to its catalog). While it could be reasonable to assume that certain groups of providers are less represented than others in the catalog (e.g., because certain categories of items are traditionally offered by providers of a given gender), the recommendation loop may lead to under-represent the minority group in the interactions more and more with respect to its group contribution in the catalog, i.e., \({\mathcal {C}}^{\text {a}_{\min }} > {\mathcal {O}}^{\text {a}_{\min }}\). This effect may inadvertently bias the learnt relevance and, consequently, detain recommendations of minority group items.

2.4 Formalizing disparities

To assess the extent to which the recommender system generates disparities, we define three core disparity metrics. One of them considers an internal perspective and monitors the difference of predicted relevance between providers’ groups. The other metrics operate on the final outcomes of the recommender system, monitoring differences in visibility and exposure.

More precisely, the disparity in relevance (\(\varDelta {\mathcal {R}}\)) is quantified as the absolute difference between the representation in the catalog (\({\mathcal {C}}^{\text {a}_{\min }}\)) and the percentage of relevance for the minority group:

$$\begin{aligned} \varDelta {\mathcal {R}} = \left| \frac{1}{|U|} \sum _{u \in U} \frac{\sum _{\text {pos}=1}^{k} {\widetilde{R}}_{u,\rho _{\theta }(u,\text {pos})} \cdot s_{\rho _{\theta }(u,\text {pos})}^A({\text {a}_{\min }})}{\sum _{\text {pos}=1}^{k} \sum _{a \in A} {\widetilde{R}}_{u,\rho _{\theta }(u,\text {pos})} \cdot s_{\rho _{\theta }(u,\text {pos})}^A(a)} - {\mathcal {C}}^{\text {a}_{\min }}\right| \end{aligned}$$
(4)

where \(\rho _{\theta }(u,\text {pos})\) represents the item recommended at position \(\text {pos}\) for users u and \({\widetilde{R}}_{u,\rho _{\theta }(u,\text {pos})}\) refers to the predicted relevance formalized in Sect. 2.1, while the terms \({\mathcal {C}}^{\text {a}_{\min }}\) and \(s_{\rho _{\theta }(u,p)}^A\) derive from Eqs. 2 and 1, respectively. Scores of \(\varDelta {\mathcal {R}}\) refer to top-k recommendations and range in [0, 1], with higher values indicating a higher disparity of the degree of relevance estimates with respect to the contribution in the catalog for the minority group.

A disparity in relevances might not necessarily imply that the minority group is discriminated based on its exposure or visibility in the recommendations lists (Singh and Joachims 2018), which is exactly what we aim to investigate in this paper. For this reason, we also define the difference between the contribution in the catalog and the percentage of visibility (\(\varDelta {\mathcal {V}}\)) and of exposure (\(\varDelta {\mathcal {E}}\)) for items of the minority group. Disparate visibility and exposure are formalized as follows:

$$\begin{aligned} \varDelta {\mathcal {V}}= & {} \left| \frac{1}{|U|} \sum _{u \in U} \frac{\sum _{\text {pos}=1}^{k} s_{\rho _{\theta }(u,\text {pos})}^A({\text {a}_{\min }})}{\sum _{\text {pos}=1}^{k} \sum _{a \in A} s_{\rho _{\theta }(u,\text {pos})}^A(a)} - {\mathcal {C}}^{\text {a}_{\min }}\right| \end{aligned}$$
(5)
$$\begin{aligned} \varDelta {\mathcal {E}}= & {} \left| \frac{1}{|U|} \sum _{u \in U} \frac{\sum _{\text {pos}=1}^{k} \frac{1}{log_2(\text {pos}+1)} s_{\rho _{\theta }(u,\text {pos})}^A({\text {a}_{\min }})}{\sum _{\text {pos}=1}^{k} \sum _{a \in A} \frac{1}{log_2(\text {pos})} s_{\rho _{\theta }(u,\text {pos})}^A(a)} - {\mathcal {C}}^{\text {a}_{\min }} \right| \end{aligned}$$
(6)

where, \(\rho\), \({\widetilde{R}}\), \({\mathcal {C}}^{\text {a}_{\min }}\), and \(s_{\rho _{\theta }(u,\text {pos})}^A\) are defined as above. Scores of \(\varDelta {\mathcal {V}}\) and \(\varDelta {\mathcal {E}}\) refer to top-k recommendations and range in [0, 1], with lower values indicating a lower disparity w.r.t. the contribution in the catalog.

3 Optimizing under different catalog-interaction representations

To illustrate the unfairness against a minority group of providers and further emphasize the value of our analytical modeling, we simulate various imbalances in catalog and interactions, for the minority group. Then, we characterize to what extent a model is unfair against the minority group. Specifically, the exploratory study presented in this section aims to assess the extent to which the share of relevance across groups depends on imbalances between catalog and observation representations, and whether reducing the degree of imbalance between the representations in the catalog and the interactions for a minority group leads to lower disparate exposure. Our hypothesis is that there is a strongly direct relationship between the imbalance in catalog-observation representations and the estimated disparities defined in Sect. 2.4.

3.1 Pair-wise optimization and exploratory protocols

Pair-wise optimization is one of the most influential approaches to train recommendation models and represents the foundation of many cutting edge personalized algorithms (Chen et al. 2017; Xue et al. 2017; Xiao et al. 2017). The underlying Bayesian formulation (Rendle et al. 2012) aims to maximize a posterior probability that can be adapted to the parameter vector of an arbitrary model class (e.g., matrix factorization or neighborhood-based). In our study, we adopt matrix factorization (Koren et al. 2009), due to its popularity and flexibility. Model parameters \(\theta\), i.e., user and item matrices, are estimated through an objective function that maximizes the margin between (i) the relevance \(f_{\theta }(u,i)\) predicted for an observed item i and, (ii) the relevance \(f_{\theta }(u,j)\) predicted for an unobserved item j. The optimization process considers a set of triplets D that are fed into the model during training:

$$\begin{aligned} D = \{(u,i,j) \, | \, u \in U, i \in I^{+}_{u}, j \in I^{-}_{u}\} \end{aligned}$$
(7)

where \(I^{+}_{u}\) and \(I^{-}_{u}\) are the sets of items for which user u’s feedback is observed and unobserved, respectively.

The original implementation proposed by (Rendle et al. 2012) requires that, for each user u, triplets (uij) per observed item i should be created; the unobserved item j is randomly selected. The objective function can be formalized as follows:

$$\begin{aligned} \underset{\theta }{{\text {argmax}}} \mathop {\sum }_{(u,i,j) \in D} \delta {(f_{\theta }(u,i) - f_{\theta }(u,j)) - \left\Vert \theta \right\Vert _2^2} \end{aligned}$$
(8)

where \(\delta\) is a sigmoid function returning a value between 0 and 1.

The code for our study was implemented in Python on top of Tensorflow. User and item matrices, with vectors of size 100, were initialized with values uniformly distributed in [0, 1]. The optimization function is transformed to the equivalent minimization dual problem. For each user, we randomly took apart \(70\%\) of their interactions for training, \(10\%\) for validation, and \(20\%\) for testing. Given the training user–item interactions, the model was served with batches of 1, 024 triplets. For each user u, we created 10 triplets (uij) per observed item i; the unobserved item j was randomly selected for each triplet. The optimizer used for gradient update was Adam. Training lasted until convergence on the validation set. Parameters were selected via grid search on the validation set.

3.2 Observations on synthetic datasets

To investigate if and to what extent the share of relevance across providers’ groups depends on imbalances between catalog and observation representations, we consider a recommendation context that associates each provider \(p \in P\) with a generic binary sensitive attribute, and we assume that each item is associated with a single provider, leaving experiments on items associated with more than one provider to the real-world datasets leveraged in Sect. 5.

Specifically, the imbalances considered in this study are subdivided in two forms: catalog imbalance and observation imbalance. Catalog imbalances emerge when providers from a different group occur in the catalog with varied frequencies. For instance, there may be significantly fewer female/male providers than male/female providers who offer items to users. On the other hand, with observation imbalances, users may interact with items from certain provider groups with different tendencies. This imbalance is often part of a feedback loop involving existing methods of recommendation, either introduced by models or by humans. If users do not receive any item offered by a provider belonging to a certain group, users will not interact with that class of providers. In cascade, models will be served with only few data on this preference relation. For instance, train data about female/male providers may be significantly less than train data about male/female providers.

To assess the interplay among representations in the catalog, interactions, and relevance, we generate a range of synthetic datasets that simulate different catalog and observation imbalances. To create them, we use a procedure based on two stochastic block models (Yao and Huang 2017), whose description is provided in Appendix A. The popularity tails, catalog, and observation representations of the resulting 15 synthetic datasets are reported in Fig. 1. Through synthetic datasets, we explore a wider range of configurations, questioning situations not usually observable in public real-world datasets but that might occur in the real world, e.g., datasets with different representations of the minority group.

Fig. 1
figure 1

Synthetic Datasets Imbalance. Popularity tail across items based on the observed interactions, conveyed by each of our synthetic dataset, according with the procedure in Appendix A (a). Catalog and observation representations of the minority group in synthetic data, where C stands for “Catalog”, O stands for “Interactions”, and \(\varDelta\) C–O is the difference between catalog and interactions representations

Once the synthetic datasets are generated, we run the pair-wise optimization procedure on all our synthetic datasets. Then, we analyze the resulting relevance scores for each provider group with respect to their contribution in the catalog and their representation in the interactions. To this end, Fig. 2 depicts the share of the items’ relevance for the minority group (left) and the difference between contribution and relevance shares for the minority group (right). It should be noted that the half-lower diagonal of the heatmap is not considered, given that we only generate synthetic datasets where the difference in representation between contribution and interactions for the minority group is non negative. Results show us that the representation in relevance (left heatmap) is consistent across datasets having the same representation of the minority group in interactions, i.e., within the same column (e.g., 0.5-0.4 and 0.4-0.4 settings). Further, for each dataset, the relevance is similar to the representation in interactions and increases as much as the representation in interactions increases (from left to right). It follows that the representation in interactions for the minority group appears to play a key role in shaping the share of relevance for the group. By extension, the disparate relevance may directly depend on the gap between the representation in contribution and in interactions (right heatmap). The smaller the gap is, the lower the difference between (i) the representation of the minority group in the catalog and (ii) the share of relevance assigned to it is. The heatmaps allow us to see to what extent the imbalance between catalog and interaction representations influences the disparate relevance. We can draw the following observation.

Fig. 2
figure 2

Contribution–Relevance Relationship. The percentage of relevance given to the minority (left) and the difference between contribution and relevance percentages (right)

Next, according to the relevance learnt by the recommendation model on each synthetic dataset, we suggested to each user \(k=10\) items; then, in Fig. 3, we measured the disparate visibility (exposure) for the minority group, both ranging between [0, 1]; we consider visibility as the percentage of providers of a given group in the recommendations (regardless of their position in the ranking), while we use a definition of exposure inspired by (Singh and Joachims 2018). Both have been previously introduced in Sect. 2.4. The higher the value is, the higher the disparate impact is. The connection of all these results allows us to understand how much the imbalances in relevance for provider groups, learnt by recommender system, result in inequalities on recommended lists.

Fig. 3
figure 3

Disparate Impacts. Disparate visibility (a) and exposure (b) for the minority group \(\text {a}_{\min }\) in top-10 lists. The disparate impacts are calculated with Eqs. 5 and 6, respectively

We can observe that the effect on exposure is more evident. We conjecture that this result might depend on the fact that, when in presence of a small minority, the items from the minority group are progressively inserted at lower positions of the top-10 or even excluded, because of the lower predicted relevance. The considerations we made suggest to investigate treatments that impact on the interaction and relevance distributions. Hence, we will play with the minority group representation in interactions and regularize the percentage of relevance given to items across groups.

4 Reducing disparities via upsampling and regularization

With an understanding of our fairness goals and of the intuitions we came up with in the exploratory study, this section describes how we can arrange a recommender system to reduce disparities, while preserving utility.

Our exploratory analysis revealed that the share of relevance may depend on the representation of providers’ groups in both the catalog and the interactions, and that the more similar the two representations are for a group, the lower the resulting disparate relevance is. It is unlikely that this property is met in interactions collected from real-world platforms, as we will later show. It follows that controlling the balance among catalog-interaction representations for a group could require to act on the interactions. To this end, we will upsample interactions of the minority group, to reduce existing imbalances.

Balanced representations of the minority group between the catalog and the interactions would not necessarily ensure a lower disparate relevance in real-world situations. Differently from the synthetic data we generated, interactions in the real world show several imbalances (e.g., due to presentation, preferences, user interfaces), which are hard to simulate, that may still distort the output relevance. It follows that, when an upsampling mechanism is not sufficient to accomplish our goals, we need a regularization approach to account for the distribution of relevance across groups during learning. Only regularizing the distribution of relevance across groups, with no upsampling, may not be enough as well, if minority interactions are too few. Hence, our treatment will control the interplay between upsampling and regularization.

To deal with upsampling, we play with the data sampling strategies that generate interaction instances (i.e., observed user–item pairs); conversely, to account for relevance, we will define a training loss function aimed to minimize the pair-wise error specified in Eq. 8 and the disparate relevance defined in Eq. 4. We will show empirically that, although the optimization relies on a given set of interactions, even artificially upsampled, the approach generalizes to real and unseen interactions. The treatment builds upon the following steps:

Interaction Upsampling. We propose to upsample interactions related to the minority group with different user–item selection techniques, with the aim of covering a range of alternative setups:

  • real consists of an upsampling of existing interactions belonging to the minority group, with repetitions. Specifically, we select the item of the existing user–item interaction to be upsampled, based on a probability function that takes into account the contribution of the minority \(s_i^{\text {a}_{\min }}\), for each item i. The higher the contribution of the minority group, the higher the probability to be selected. Then, the real interactions involving the selected item i are retrieved, and the one to be upsampled is randomly selected.

  • fake stands for a random upsampling on synthetic interactions, with no repetitions. This strategy adds new interactions related to items from the minority group. Similarly to real, the item involved in the upsampled interaction is selected based on a probability function that accounts for the contribution of the minority \(s_i^{\text {a}_{\min }}\), for each item i. Then, the user to be included in the upsampled interaction is randomly selected among those users of U who have not already interacted with item i.

  • fake-by-pop refers to an upsampling of synthetic interactions based on item popularity, with no repetitions. Given items with at least one provider from the minority, the item to be inserted in the upsampled observation is selected according to an item–popularity probability. The higher the popularity is, the higher the probability to be selected is. The user of the upsampled interaction is randomly chosen among those users of U who have not already interacted with item i.

These strategies assume to upsample pairs (ui), until the representation of the minority group in the interactions meets a target percentage of the total interactions. This percentage, investigated in the experimental section, will first target the representation of the minority group in the catalog.

Regularized Optimization: Given a range of batches of training data samples \(T_{batch}\) (i.e., either pairs for a point-wise approach or triplets for a pair-wise approach), built on top of the upsampled interactions, each training batch is fed into a model that follows a regularized paradigm, derived from a traditional optimization setup. The loss function can be formalized as follows:

$$\begin{aligned} \underset{\theta }{{\text {argmax}}} \,\, (1 - \lambda ) \, \text {acc}(\text {T}_{\text {batch}}) - \lambda \, \text {reg}(\text {T}_{\text {batch}}) \end{aligned}$$
(9)

where \(acc(T_\mathrm{{batch}})\) is the original accuracy loss, computed over \(T_\mathrm{{batch}}\). In our experimental study, we deal with a pair-wise optimization, thus the accuracy loss is computed as in Eq. 8. The \(\lambda \in [0,1]\) parameter expresses the trade-off between accuracy and disparate relevance. With \(\lambda =0\), we yield the output of the recommender, not taking disparate relevance into account. Conversely, with \(\lambda =1\), the output of the recommender is discarded, and we focus on minimizing disparate relevance.

The regularization term, \(reg(T_\mathrm{{batch}})\), operationalizes our strategy of disparate relevance minimization, based on Eq. 4. The proposed criterion is equivalent to compute, in percentage, the relevance received by minority group items in a batch with respect to the total relevance received by all items in that batch and then balance it to the percentage of contribution of the minority group in the catalog. Let \(C^{\text {a}_{\min }}\) be the contribution of the minority group in the catalog, computed as in Eq. 2, the regularization can be defined as follows:

$$\begin{aligned} \begin{aligned} reg(\text {T}_{\text {batch}}) = \left( \frac{\sum _{(u,i,\_) \in \text {T}_{\text {batch}}} f_{\theta }(u,i) \cdot S_i^A({\text {a}_{\min }})}{\sum _{(u,i,\_) \in \text {T}_{\text {batch}}} f_{\theta }(u,i)} - C^{\text {a}_{\min }} \right) ^2 \end{aligned} \end{aligned}$$
(10)

where \(S_i^A({\text {a}_{\min }}) = s_i^A({\text {a}_{\min }}) / \sum _{a \in A} s_i^A(a)\) is the percentage of minority providers who have been involved in the production/creation of item i. These regularized optimization implies that the model is penalized if the difference in relevance and contribution for the minority group of providers is high. The choice of the squared value, instead of an L2 norm or an Earth mover distance as examples, has been proved to be of benefit in optimization, while being simple and effective. Our framework can be easily extended to other options. The contextualization with respect to the literature is presented in Sect. 6.2.

5 Experimental treatment evaluation and analysis

In this section, we empirically study the effects of each component of our treatment and of the treatment as a whole on the needs of both users (i.e., recommendation utility) and providers (i.e., disparate relevance, visibility, and exposure). We answer the following four research questions:

RQ1.:

How much should we upsample minority group interactions to improve the trade-off between recommendation utility and disparities?

RQ2.:

To what extent do upsampling and regularization impact on the trade-off between recommendation utility and disparities, individually and jointly?

RQ3.:

How does our treatment concretely reduce disparities for the minority group? How does it impact on internal mechanisms?

RQ4.:

To what extent does our treatment affect disparities, utility, and coverage, compared with others? Can the latter benefit from regularized relevances?

5.1 Experimental setup

5.1.1 Datasets

In order to validate and ensure the reproducibility of our proposal, we selected datasets that are publicly available, covering different domains. We remark that this experimentation is made difficult because there are very few datasets targeting our scenario, and the datasets we consider are highly sparsed.

Movielens-10M (ml-10m) (Harper and Konstan 2016) includes 10M ratings applied to 10k movies by 72k users. In order to be fed into a pair-wise model, interactions are binarized using a threshold (i.e., ratings equal or higher than 3 are marked as 1, the other ones are changed to 0). This dataset does not contain sensitive attributes of the providers and there is no notion of provider. Our study considers movie directors as providers to reflect a real-world scenario. To link movies to their corresponding directors, we capitalized on the methods offered by the TMDB APIsFootnote 2. Specifically, we used the getCredits(tmdbId) method to retrieve data about people involved in the movieFootnote 3. We filtered records for individuals with “Director” as a role. Then, we called the getDetails(peopleId) method, passing the id retrieved for each director. The latter method outputs a list with the name and the gender of the director. Note that there are movies with more than one director. The representation of women directors is around \(6\%\) in the catalog and \(3.9\%\) in the interactions.

COCO Course Collection (coco) (Dessì et al. 2018) includes 74k learners, who gave 600k ratings to 10k online courses. Similarly to ml-10m, ratings are binarized using a threshold (i.e., ratings equal to 5 are marked as 1, the other ones are changed to 0). We selected this threshold due to the extremely high imbalance among rating values, as reported in the original paper. In this scenario, we assume that instructors act as providers. Providers representing a company or an institution were removed, since there was no practical way to associate their items to gender representations. One or more instructors could cooperate in the same course. However, no information about their gender is reported. To extract this attribute, we considered their naming informationFootnote 4. Specifically, we used the methods offered by GenderAPIsFootnote 5, that allow to determine the gender by naming information, with a certain confidence. Such a practice has been conducted in prior work to deal with the absence of gender labels (Chen et al. 2018; Mansoury et al. 2019). Only predictions with a confidence higher than \(75\%\) were kept. The representation of women instructors in the catalog is around \(17\%\), reduced to \(12\%\) in the interactions.

5.1.2 Evaluation metrics

In this section, we present the metrics we considered to assess the impact of our work. In addition to the disparity metrics introduced in Sect. 2.4, which cover the aspects associated with providers’ fairness, several other perspectives of the recommender system should be considered. Our study in this paper also includes an assessment (i) of personalization in terms of recommendation utility and (ii) of coverage of items for the provider groups and as a whole.

Personalization: To evaluate personalization, we compute the utility of recommended lists via Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen 2002).

$$\begin{aligned} \text {DCG}(k|\theta )= & {} \sum _{u \in U} {\widetilde{R}}_{u,\rho _{\theta }(u,1)} + \sum _{pos=2}^{k} \frac{{\widetilde{R}}_{u,\rho _{\theta }(u,pos)}}{log_2(pos)} \end{aligned}$$
(11)
$$\begin{aligned} \text {NDCG}@k(k|\theta )= & {} \frac{\text {DCG}(k|\theta )}{\text {IDCG}(k|\theta )} \end{aligned}$$
(12)

where \(\rho _{\theta }(u,pos)\) is the item i recommended to user u at position pos, and the values in \({\widetilde{R}}\) formalized in Sect. 2.1 are considered as user–item relevances, while computing DCG. The ideal DCG is calculated by sorting items based on decreasing true relevance (i.e., for an item, the true relevance is 1 if the user interacted with the item in the test set, 0 otherwise). The higher the NDCG score achieved by the recommender system is, the more effective the generated recommendations are for consumers.

Item Coverage: In addition to personalization and disparate impacts, we measure the total coverage of items (\(\text {Cov}_{\text {tot}}\)) and of items delivered by providers in the minority (\(\text {Cov}_{\text {a}_{\min }}\)) and the majority (\(\text {Cov}_{\overline{\text {a}_{\min }}}\)) group. Coverage is an important property (Kaminskas and Bridge 2017), since an approach that only increases the recommendation of one item provider of the minority group would not likely fair within the minority group.

$$\begin{aligned} \text {Cov}_{\text {tot}}= & {} \frac{1}{|I|} \sum _{i \in I} \min \left( 1, \sum _{u \in U} |{\widetilde{I}}_u \cap \{i\}|\right) \end{aligned}$$
(13)
$$\begin{aligned} \text {Cov}_{\text {a}_{\min }}= & {} \frac{1}{|I^{\text {a}_{\min }}|} \sum _{i \in I^{\text {a}_{\min }}} \min \left( 1, \sum _{u \in U} |{\widetilde{I}}_u \cap \{i\}|\right) \end{aligned}$$
(14)
$$\begin{aligned} \text {Cov}_{\overline{\text {a}_{\min }}}= & {} \frac{1}{|I \setminus I^{\text {a}_{\min }}|} \sum _{i \in I \setminus I^{\text {a}_{\min }}} \min \left( 1, \sum _{u \in U} |{\widetilde{I}}_u \cap \{i\}|\right) \end{aligned}$$
(15)

where \(I^{\text {a}_{\min }} = \{i : s_i^A(\text {a}_{\min })>0\}\) is the set of items that have at least one provider belonging to the minority group. Each coverage score ranges in [0, 1], with values close to 1 for higher coverage values.

5.1.3 Experimental setting

We considered several optimization settings, each one characterized by a different combination of upsampling and regularization treatments, as proposed in Sect. 4. They are briefly identified as follows:

  • baseline: training without any upsampling and regularization treatment;

  • real: only real upsampling;

  • fake: only fake upsampling;

  • fake-by-pop: only fake-by-pop upsampling;

  • reg: only regularization;

  • real+reg: real upsampling, followed by regularization;

  • fake+reg: fake upsampling, followed by regularization;

  • fake-by-pop+reg: fake-by-pop upsampling, followed by regularization.

5.1.4 Implementation details

For each dataset, a temporal train–test split was performed by including the last \(20\%\) of interactions released by a user into the test set, \(10\%\) of interactions were included into the validation set, and the remaining \(70\%\) oldest ones into the training set (Campos et al. 2014; Sánchez and Bellogín 2020). Embedding matrices, with vectors of size 100, were initialized with values uniformly distributed in [0, 1]. The optimization function was transformed to the equivalent minimization dual problem. During training, the model was served with batches of 1, 024 training triplets, chosen from a pre-computed set of triplets. To populate it, for each user u, we create 10 triplets (uij) per observed item i; the unobserved item j is randomly selected for each triplet. Before each epoch, we shuffle the training batches. The learning rate for the Adam optimizer is 0.01. The dot function was used to compute the similarity (i.e., the relevance) between user and item vector. Each model was trained until convergence on the validation set, for a maximum of 100 epochs.

5.2 Experimental results

Fig. 4
figure 4

Influence of Upsampling Degree on Trade-off. The trade-off between Normalized Discounted Cumulative Gain (NDCG: red line with bullet markers) and Disparate Exposure (\(\varDelta \, {\mathcal {E}}\): blue line with star markers) based on the degree of upsampling, varying the upsampling techniques and datasets. Dotted lines indicate the degree of upsampling resulting in a good trade-off (i.e., high NDCG and low \(\varDelta \, {\mathcal {E}}\)). Disparate visibility and relevance showed similar patterns and are omitted for the sake of clarity and readability

5.2.1 Comparing upsampling techniques (RQ1)

With this experiment, we aim to understand to what degree upsampling influences recommendation utility and disparate impacts on group relevance, on visibility, and on exposure and investigate how and how much we should upsample to obtain a good trade-off among the metrics. Although our exploratory study revealed that paring the percentage of interactions for the minority group with the percentage of contribution in the catalog may be the best choice, interactions in real world show several imbalances that may distort the output relevance. Hence, we experiment with different degrees of upsampling, not just targeting a minority group representation in the interactions equal to its representation in the catalog.

To this end, for each dataset and upsampling technique, we created a range of model instances fed with a different amount of upsampled data, using the upsampling techniques described in Sect. 4. Results in Fig. 4 depict NDCG and \(\varDelta {\mathcal {E}}\) at increasing percentage of minority observation upsampling. Patterns related to \(\varDelta {\mathcal {R}}\) and \(\varDelta {\mathcal {V}}\) were similar to the ones obtained on \(\varDelta {\mathcal {E}}\), so we do not report them for conciseness and readability. The considered plots show us that NDCG tended to decrease, when the amount of upsampled data became larger. The loss in recommendation utility depends on the dataset and the technique, with fake suffering from the largest loss. Conversely, we observed that \(\varDelta \, {\mathcal {E}}\) achieved the lowest value for an upsampling between \(15\%\)-\(20\%\), depending on the dataset. This latter behavior came from the fact that, for small upsampling amounts, the model tended to show a disparate impact in favor of the majority group. Increasing upsampling leads the minority to get more and more exposure; this can get to the point where the majority is affected by a disparate impact, i.e., the minority group is favored more than expected (e.g., in Fig. 4a, when upsampling is greater than \(4\%\)).

Moving to the comparison of the results with different datasets, coco experiences a lower loss in NDCG for the same upsampling technique against ml-10m. Interestingly, for small upsampling amounts, NDCG ends up increasing in coco, with respect to the baseline, which does not make use of upsampling. Furthermore, coco is more susceptible to the amount of upsampling, resulting in larger variations of \(\varDelta {\mathcal {E}}\). Considering the same dataset and observing patterns for different upsampling techniques, it can be observed that real preserves a good level of NDCG, even for high amounts of upsampling. Conversely, \(\varDelta \, {\mathcal {E}}\) follows similar patterns for all the upsampling techniques. An exception is made for real on ml-10m, which showed a decreasing while noisy trend on \(\varDelta {\mathcal {E}}\). Therefore, while upsampling in general is beneficial for controlling \(\varDelta \, {\mathcal {E}}\), each of the techniques differently preserves the NDCG originally achieved, changing the trade-off between effectiveness and disparate impacts.

Table 1 Impact of Upsampling on Recommended Lists.

To characterize the peculiarities of each upsampling technique, Table 1 reports information on recommendation utility, disparate impact, and coverage for representative settings, which achieved a good trade-off. Results show us that, in general, upsampling brings benefits to disparate impacts and coverage, while preserving recommendation utility. Specifically, on coco, real experienced a disparate impact lower than \(1\%\) at all levels (i.e., relevance, visibility, exposure) and doubles the coverage of minority group items (i.e., column \(\text {Cov}_{\text {a}_{\min }}\)). Conversely, fake-by-pop allowed us to improve the original recommendation utility, but disparate impact and coverage did not experience the same gains of real. On ml-10m, similar patterns were observed for real, even though the loss in NDCG was larger. Compared with coco, fake and fake-by-pop achieved a better trade-off among metrics on ml-10m.

5.2.2 Benchmarking combined treatments (RQ2)

Even though upsampling made it possible to achieve good trade-offs, there are still disparities that should be reduced. Hence, in this experiment, we are interested in understanding the impact of regularization on the representative settings considered in the previous section. To this end, we applied the regularization described in Sect. 4 to each of the settings reported in Table 1. Given that the disparate impacts to get reduced are often small, we adopt a \(\lambda =1e^{-6}\) as a regularization weight. Our empirical results with lower or larger \(\lambda\) values led to unreasonable variations in the validation set.

Results in Table 2 show us recommendation utility, disparate impact, and coverage achieved by the model instance trained with upsampling and regularization jointly. When comparing results between baseline and reg, it can be observed that a plain regularization, without upsampling, fails to bring a proper reduction of disparate impact. This is caused by the fact that the regularization depends on the amount of minority group interactions, and the amount of such a data is small when upsampling is not performed. Conversely, the regularization can introduce benefits for the other settings, especially for fake and fake-by-pop settings. We can draw the following observation.

The regularization is essential to fine-tune the trade-off in cases where the upsampling alone does not allow to reduce it anymore. On both coco and ml-10m, this effect is observed for the fake and fake-by-pop. With a small loss in NDCG, disparate impact and coverage experienced substantial improvements. Under the real scenario, the regularization improves NDCG, with a small loss in the other metrics. Each upsampling technique, combined with regularization, leads to a good trade-off between utility and disparity.

Table 2 Impact of Regularization on Recommended Lists.

5.2.3 Provider-level walk-through inspection of the treatment (RQ3)

Next, we analyze how our treatment affects the internal mechanisms of the user–item relevance learning step, and how these internal changes influence the recommended lists. To this end, we focus on a walk-through example of the problem and how our treatment addresses it. The goal is to understand where and how our treatment supports minority providers.

Fig. 5
figure 5

Walk-through Example. Model properties concerning minority providers on coco, considering a baseline recommender and treatments with fake upsampling (\(+0.09\) of minority data) and a regularization (with \(\lambda =1e10^{-6}\)). a number of triplets where the minority group is involved for the observed/unobserved item; b average number of triplets where a minority provider is involved for the observed item; c average margin between observed and unobserved items in a triplet, for triplets involving observed items of a minority provider; df average relevance, visibility, and exposure proportion assigned to items of the minority

To characterize our treatment, we consider the baseline recommender optimized on coco data. We are interested in showing how our treatment based on fake upsampling (\(+0.09\) of minority data), followed by a regularization (with \(\lambda =1e10^{-6}\)), changes the internal and external properties shown by the baseline. Similar observations can be still applied to other settings. Figure 5a depicts the number of training triplets wherein an item delivered by a minority provider appears as an observed item (positive) or unobserved item (negative). Being under-represented in the interactions, items of minority group providers appear less frequently as an observed item under the baseline setting (leftmost pair of bars). It follows that the average number of triplets per provider, where a given minority provider is involved for the observed item is limited, as reported in Fig. 5b (leftmost box plot). These imbalances strongly influence the ability of the pair-wise optimization of computing good margins between the observed and the unobserved item, when the former is delivered by a minority provider (Fig. 5c—leftmost box plot). With our upsampling, we introduce new user–item interactions involving minority providers, with more triplets for the minority group and a higher number of triplets per minority provider, on average (Fig. 5a and 5b—two rightmost box plots). This results in larger positive margins between observed and unobserved items for items of a minority provider (see Fig. 5c, fake setting). Despite relying on the same upsampled data, the regularized version further condenses the margins for observed items of minority providers around the average value (Fig. 5c, fake+reg setting). This treatment fundamentally changes the relevance assigned to items for each minority provider and, by extension, their visibility and exposure, as highlighted in Fig. 5d–f.

5.3 Comparing against other treatments (RQ4)

We next compare our treatment against representative state-of-the-art alternatives to assess (i) how the considered treatments differently influence recommendations in terms of disparities, utility, and coverage, and (ii) whether the regularized relevance scores obtained through our treatment can lead to benefits for state-of-the-art mitigation procedures that operate in post-processing settings. Our goal in this section is to assess how far an in-processing strategy that reduces disparate relevance is from a post-processing strategy that directly controls exposure or visibility in rankings, in achieving good trade-offs. This experiment will provide evidence on the benefit of controlling the relevance distribution via upsampling and regularization. To this end, for each of the considered datasets, we decided to compare the recommendations generated after applying our real+reg treatment, which still use only real users’ interactions, against those generated by the following three state-of-the-art mitigation procedures:

  • far (Liu et al. 2019) is a fairness criterion that combines a personalization-induced term and a fairness-induced term, with a parameter \(\lambda\) controlling the trade-off between the two. The relevance score determined by the base recommender indicates the probability of a user being interested in an item. The fairness score promotes the items that belong to currently uncovered gender groups. We set up the size of ranked lists \(k=10\), the trade-off parameter \(\lambda =8.0\), and the desired percentage of minority items \(p=C^{\text {a}_{\min }}\).

  • fa*ir (Zehlike et al. 2017) is a fairness criterion based on maximizing utility while ensuring that the proportion of minority items in every prefix of the top-k ranking remains statistically above or indistinguishable from a given minimum, as long as there are enough minority items to achieve that minimum proportion. We set up the size of the considered ranking window \(k=50\), the statistical significance parameter \(\alpha =0.1\), and the parameters controlling the ranking prefixes \(p=0.35\) for ml-10m and \(p=0.50\) for coco. This will ensure that at least two and one ranking prefixes are used by the re-ranking, where two and one should be the minority items in a top-10 ranking for the datasets, respectively, according to their \(C^{\text {a}_{\min }}\) values.

  • fair-rec (Patro et al. 2020) is a fairness criterion that, while maximizing utility and consumer fairness, aims to guarantee a uniform exposure distribution across providers. The first phase ensures user fairness among all the customers and tries to provide a minimum guarantee on exposure of the providers. Given that the first phase may not allocate exactly k items to all the consumers, a second phase ensures this property while simultaneously maintaining provider fairness. We set up the size of the considered rankings \(k=10\) and the fraction of share guarantee to be guaranteed to every provider \(\alpha =0.5\).

These algorithms have been selected due to their different underlying approach and their ability to work with recommended lists. Each parameter in the corresponding algorithm has been reported after monitoring the trade-off between utility and disparate exposure. The first of the three algorithms has been re-implemented from scratch. The other ones were based on the original code provided by the authors in the paperFootnote 6.

In order to answer to the first question, Table 3 provides ranking utility, disparity, and coverage scores for (i) our setting real+reg and (ii) the three other re-ranking algorithms fed with the relevance scores returned by the original recommender. It can be observed that our treatment real+reg has a good trade-off between utility and disparity. Specifically, an improvement of \(19\%\) in NDCG on both datasets is introduced by our setup, when compared with the best NDCG among the other alternatives, i.e., far’s NDCG. In parallel, our treatment leads to the lowest disparate exposure, with a decrease of \(97\%\) in coco and \(48\%\) in ml-10m against fa*ir (the second best treatment in disparate exposure). It follows that our treatment reduces disparate exposure by moving up minority items which are also of interest for the consumers. In terms of coverage, fair-rec beats all the other treatments. However, in coco, such a higher coverage does not involve more items of the minority group. Indeed, while achieving an overall lower coverage, real+reg covers more items of the minority group with respect to fair-rec. Conversely, in ml-10m, the latter outperforms our treatment in minority group item coverage, but it leads to a NDCG lower of \(26\%\) and to a disparate exposure higher of \(78\%\), both being statistically significant. This allows us to draw the following observation:

Table 3 Comparison against Other Treatments.
Table 4 Benefits of Our Regularized Relevances to Other Treatments.

On the other hand, Table 4 allows us to answer to the second question, understanding whether the regularized relevance scores obtained through our treatment lead to benefits for state-of-the-art mitigation procedures that operate in post-processing settings. This table reports the utility, disparity, and coverage scores for the three state-of-the-art re-ranking algorithms fed with the relevance scores returned by our real+reg treatment, together with the relative improvement with respect to not using our non-regularized relevance scores. It can be observed that applying our treatment before re-ranking leads to a reduction of disparate exposure between \(37\%\) and \(86\%\). This improvement comes at the price of a negligible loss in accuracy in coco (\(-5\%\) and \(-13\%\)), while the utility improves thanks to our regularized relevance scores in ml-10m (+\(6\%\) and \(+10\%)\). It follows that our treatment acts as a driver for improving the trade-off between effectiveness and disparities, highlighting the role of relevance scores in this context. In addition to this, all the settings show a higher overall coverage and coverage of minority items with respect to the non-regularized counterpart. Hence, we make the following observation:

Interestingly, comparing the results achieved with our treatment real+reg against those obtained with the other treatments under a regularized relevance setting, it can be observed that using the considered post-processing approaches fed with regularized relevances does not lead to substantial gains in utility and disparity trade-offs.

Despite being related to disparate exposure, the fairness objective originally pursued by the considered countermeasures slightly differs from that targeted by this paper. Therefore, we also monitored the influence of our regularized relevances to those original fairness objectives. In (Liu et al. 2019), for far, Liu et al. monitored that each re-ranked list covers as many provider groups as possible. Under a baseline setting, the percentage of rankings covering both the providers’ groups is \(55\%\). This percentage increases to \(98.4\%\) with far and to \(99\%\) with far fed with the relevance scores returned by real+reg. Conversely, Zehlike et al. (Zehlike et al. 2017) used a ranked group fairness criterion that declares a ranking as unfair if the observed proportion of items from the minority group is far below the target one. Specifically, this criterion can be abstracted as comparing the number of protected items in every prefix of the ranking, with the expected number of protected items, if they were picked at random using Bernoulli trials. Under a baseline setting, the percentage of rankings that satisfy this criterion is \(34.6\%\). This percentage increases to \(78.7\%\) with fa*ir and to \(80.2\%\) with fa*ir fed with the relevance scores returned by real+reg. Finally, given that the total exposure of the platform remains limited to \(k*|U|\), Patro et al. (Patro et al. 2020) aimed to guarantee that the items of each provider are recommended at least \((k*|U|) / |P|\) (i.e., this goal refers to the maximin marginal score value for the providers). Under a baseline setting, the percentage of providers that satisfy this criterion is \(23.3\%\). This percentage increases to \(76.4\%\) with fair-rec and to \(77.2\%\) with fair-rec fed with the relevance scores returned by real+reg. It follows that our approach not only leads to lower disparity, but also preserves the original objective of the post-processing algorithm and algorithm’s utility.

5.4 Discussion

Our experiments demonstrate that our intuitions were feasible for controlling the degree of share conveyed by relevance scores with respect to the contribution of the providers in the catalog. Our metric can be also optimized.

Beyond our empirical work, we believe that our mapping approach to associate providers’ sensitive attributes to items sheds light on new perspectives of fairness in recommender systems. Many platforms include a range of items, whose mapping with the sensitive attributes of the providers is not as direct as in the case of items representing individuals. Existing approaches would move towards this direction and future fairness-aware recommendation approaches would require to embed this mapping to realistically shape real-world conditions. Indeed, this aspect will also drive the creation of new evaluation metrics and protocols that allow to investigate algorithmic facets so far underexplored.

Our study uncovered key connections among core components of optimization of recommendation models, while dealing with provider fairness. These results would promote even more the inspection of internal mechanisms in traditional strategies (e.g., pair-wise and point-wise), with a pro-active reaction to unfairness. Despite being relatively simple, our combination of upsampling and regularization provides fairness to target groups of providers, which could not be achieved individually by such components. Beyond being applied alone, our treatment can be envisioned as a pre-processing step for procedures that seek to have a fine-grained control of fairness, acting directly on recommended lists. In this case, our adjusted relevance scores can be used in post-processing fairness-aware procedures, possibly leading to a new space of optimization between fairness and recommendation utility. Our treatment is flexible enough to incorporate other strategies for controlling the share of relevance obtained through a recommendation algorithm, opening to interesting future work.

Since our study relied on a range of assumptions, we identified the main limitations of the approach presented in this paper, as listed below.

  • The validity of the fairness notion we used is dependent on the integrity of the platform catalog, requiring to audit the catalog curation for sampling bias against direct discrimination (e.g., an educational platform that refuses to add courses provided by female instructors to its database).

  • Our empirical work dealt with scenarios with a very small minority, accounting for only \(5\%-17\%\), depending on the dataset. There are many domains (or attributes) without this kind of minority; and this may lead to novel extensions and variants, starting from those suggested in this paper.

  • Experiments were based on a binary gender construct, with datasets providing only two genders, “male” and “female”. Despite we had actually no chance of considering “non-binary” constructs, our formulation can be still applied to attributes with more than two genders. We remind readers to (Hamidi et al. 2018) for consideration on the possible consequences of gender inference.

  • Grouping individuals in the COCO dataset relied on gender inference. However, this inference does not consider important elements, such as the intersectionality of gender with other sensitive attributes (e.g., geographic origin), the possibility of inferring non-binary gender labels, and how individuals self-identify themselves, since they capitalize on large historical databases. Being aware of this limitation, we did not provide any observation related to genders, and we used this dataset to assess the validity of our approach when a minority is present, regardless of the given gender being the minority.

  • To better characterize our contributions, we focused on a matrix factorization approach optimized via pair-wise comparisons. Other variants could be tested with our framework as well, since our treatment does not rely on any specific peculiarity of the pair-wise optimization. (We used it as it better aligns to top-k recommendation problems).

  • Our approach does not have any mathematical guarantees on other notion of fairness in the recommended lists; however, we showed that it leads to a more balanced share of relevance and provides benefits to disparate impacts on visibility and exposure w.r.t. the contribution of the minority in the catalog. Further, our approach can be used as a pre-processing step for relevance scores, before using them with other treatments, e.g., (Biega et al. 2018).

Despite these limitations, we believe that the intervention on the relevances we performed contribute to a better understanding of recommender systems.

6 Related work

Our research is inspired by works on two areas that impact on recommender system research: (i) notions recently formalized in the context of fairness-aware rankings and (ii) unfairness mitigation procedures on recommended lists.

6.1 Provider fairness notions in ranking and recommendation

Fairness for groups traditionally requires that groups’ exposure should be equally distributed between groups characterized by sensitive attributes (e.g., gender, race). Biega et al. (Biega et al. 2018), Singh and Joachims (Singh and Joachims 2018), and Yadav et al. (Yadav et al. 2019) consider a notion of fairness based on equity. Despite working on provider groups, our work situates fairness in the context of recommender systems, allowing us to (i) account for situations where more providers lie behind an item and the same provider can appear more than once in a list, (ii) relate them with the objectives and formalism of recommendation metrics, and (iii) introduce a new experimentation on disparities among provider groups, starting from disparate relevance. Indeed, we control unfairness at an earlier stage, targeting the catalog contribution and not system-predicted relevance. Further, provider unfairness is traditionally mitigated by assuming to have access to true unbiased relevances. In practice, these relevances are estimated via machine learning, leading to a biased estimate of the relevance scores. Recommender systems are known to be biased from several perspectives (e.g., popularity, presentation, unfairness for users and providers). With this in mind, we control how relevance scores are distributed to groups.

Comparing an outcome distribution (e.g., ranked lists) with a population distribution was explored by Yang and Stoyanovich (Yang and Stoyanovich 2017) and Sapiezynski et al. (Sapiezynski et al. 2019). Differently from us, Sapiezynski et al. model uncertainty of group membership of a given individual, not dealing with contexts where more than one provider lies behind an item. Further, the outcome distribution is linked to a population distribution, assuming that the items the vendor chooses to show in the top-k are a proportional representation of a subset of the catalog, subsampled via machine learning. This assumption may underestimate the real representation in the catalog, with respect to ours. Further, Yang and Stoyanovich compute the difference in the proportion of members of the protected group at top-k and in the overall population. Compared to them, we internally control relevance according with contribution. Their formulations complement our ideas, as they drive fairness optimization at different levels.

Other fairness definitions in practice lead to enhanced fairness in exposure, for instance, by requiring equal proportions of individuals from different groups in ranking prefixes (Celis et al. 2018; Zehlike et al. 2017; Zehlike and Castillo 2020). Mehrotra et al. (Mehrotra et al. 2018) achieved fairness through a re-ranking function, which balances accuracy and fairness by adding a personalized bonus to items of uncovered providers. Similarly, Burke et al. (Burke et al. 2018) defined the concept of local fairness and identified protected groups based on local conditions. In contrast to this, we study metrics that have a link between contribution, interactions, and relevance. The setup we study in this paper is very different, assuming that providers get relevance and, possibly, visibility and exposure, according to their contribution in the catalog.

Furthermore, Patro et al. (Patro et al. 2020) account for uniform exposure over providers, while we deal with a relevance proportional to the providers’ group contribution. Moreover, their definition assumes that items are not shareable, i.e., no item is allocated to multiple providers. Kamishima et al. (Kamishima et al. 2018) models fairness as an independence between the predicted rating values and sensitive values of the providers, not taking into account any measure against biased relevances (i.e., predicted rating), with respect to the contribution in the catalog. Beutel et al. (Beutel et al. 2019) shapes fairness of providers in the context of pair-wise optimization, claiming fairness if the likelihood of an observed item being ranked above another relevant unclicked item is the same across both groups. Similarly, Narasimhan et al. (Narasimhan et al. 2019) propose a notion of pair-wise equal opportunity, requiring pairs to be equally likely ranked correctly regardless of the group membership of both items in a pair. Compared with prior work, our approach aims to bind relevance and catalog contribution, for reducing disparate visibility and exposure.

6.2 Treatments for provider fairness

There are relationships between our treatment and existing approaches, even though it should be trivial to consider that treatments fundamentally vary due to the different fairness notion they are driven by.

Pre-processing for fairness in recommender systems has been considered in the context of consumer fairness. Rastegarpanah et al. (Rastegarpanah et al. 2019) proposed to add new fake users who provide ratings on existing items, to minimize the losses of all user’s groups, computed as the mean squared estimation error over all known ratings in each group. Despite working on the provider side, our upsampling extends the interactions of the real users and items and aims at adjusting interactions involving minority providers.

In-processing regularization in recommender systems has traditionally focused on point-wise scenarios. Kamishima et al. (Kamishima et al. 2018) introduce a regularization requiring that the distance between the distribution of predicted ratings for items belonging to two different groups is as small as possible. However, this way of optimizing does not indicate much about the resulting recommended lists that users actually see, with respect to the pair-wise optimization we leveraged, and do not take into account to what degree ratings of different groups are proportional to the group contribution in the catalog.

Beutel et al. (Beutel et al. 2019) targeted provider fairness optimization, under a pair-wise optimization scenario, similarly to us. However, while the pair-wise comparisons are at the basis of their fairness definition, our treatment is just tested under a pair-wise optimization scenario and does not leverage any peculiarity of this scenario. Further, while both have been tested on binary attributes, we generalize to capture a wider variety of groups and generalize to contexts where items are associated to more than one provider. Their training methodology is also very different. The fixed regularization term they added to the loss function is based on a correlation between the residual estimate and the group membership. These conceptual and operative differences lead us to investigate clearly different under-explored facets. Furthermore, compared to our work, they are driven by a different fairness objective. It would be interesting to see how they can be integrated, taking the benefits of both the notions, but this requires non-trivial extensions left as a future work.

Finally, other fairness-aware approaches, whose notions of fairness were presented in the previous section, are operationalized in quite different ways. Biega et al. (Biega et al. 2018) solve an integer linear program. Patro et al. (Patro et al. 2020) implement a greedy-round-robin strategy. Similarly to us, Zehlike and Castillo (Zehlike and Castillo 2020) use stochastic gradient descent, but they operationalize it in a list-wise manner.

7 Conclusions

In this paper, we assessed the extent to which a recommender system emphasizes disparities from three different perspectives. The first one monitors the difference of predicted relevance between providers’ groups, according with their representation in the catalog. The other two operate on the final outcomes of the recommender system, by monitoring differences in visibility and exposure across groups, with the same proportional setting. To reduce the emerged disparities, we proposed a treatment that combines an upsampling of interactions from the minority group and a regularization on the share of relevance across provider groups, throughout the optimization process.

Our experimental study analyzes relevance scores and recommended lists generated by fifteen synthetic datasets that simulate specific situations of imbalance in the catalog and in the interactions, and two real-world datasets that represent existent conditions in modern platforms. Our first exploratory results highlight that the discrepancy between the relevance given to provider groups by recommendation models and their contribution in the catalog is not negligible. This effect results in less than expected visibility and exposure for the minority group. With our treatment, it has been possible to reduce disparities in relevance, visibility and exposure, without sacrificing recommendation utility. Incorporating our treatment allows to act indirectly on the output of the recommendation model and is a viable strategy to mitigate distortions at an earlier step. Our treatment has been proved to be useful also for post-processing fairness procedures in order to achieve lower disparities.

Future work will embrace all the insights provided in this paper to further explore the connection between relevance, visibility, and exposure. Moreover, we plan to design mitigation methods that look at provider fairness promotion as a temporal process. The improvement in provider fairness might not be large immediately, and we believe that repeating our treatment over time will lead to more and more fair recommendations. This would better fit with real-world situations and platforms. We will also investigate the relationships between the recommendations returned by the algorithm and the tendency of each user to prefer items from different groups of providers. Finally, it is our goal to devise other treatments that link internal parameters to ranking metrics.