Many efforts have been made to try to ‘debias’ the data and create a ‘fair’ model. But what is fairness? And how can this translate to machine learning?
Fairness in machine learning has attracted the attention of researchers in the AI, Software Engineering, and Law communities with a steady increase in related work over the past few years. Many efforts have been made to try to ‘debias’ the data and create a ‘fair’ model. But what is fairness? And how can this translate to machine learning?
What is fairness?
To begin, let’s start with what it isn’t. Fairness is not a technical or statistical concept and there can never be a tool or software that can fully ‘de-bias' data or make a model ‘fair’. Fairness is an ethical concept and therefore a contextual one. At best, we can select some ideal of what it means to be ‘fair’ in a specific context, and then make progress towards satisfying it in our particular setting.
We can start by examining data science and research through the lens of power. Asking fair to whom, for whom, by whom, and with whose interests and goals are at the centre of the data, the model, and the system. The answers to these questions are contextual in both society and in mathematics, and are core to a basic and ethical design process.
There are different schools of thought about fairness and ethical decision making. Currently there are over 80 different ethical guidelines for AI. In mathematics, at last count, there are 21 mathematical definitions of fairness. So as there is no one definition of fairness or ethics, there are also no clear agreements on which definition to apply in each individual situation. These definitions can even cancel one another out in what is called the impossibility theorem1 which states that no more than one of the three fairness metrics (risk assignments) of demographic parity, predictive parity, and equalized odds can hold at the same time for a well calibrated classifier2. Therefore, trade-offs need to be made.
Number of publications related to fairness in ML 3
But why are there so many definitions? Why can the same case be considered fair according to some definitions and unfair according to others?
To explore, let’s look at three different fairness metrics, two for groups and one for individuals, see where they differ conceptually, where they work, and what gaps they create.
We will use the example of job applications between groups of male and female candidates.
Group Fairness Metrics
One way to look at fairness is using group fairness metrics (or group statistical property) - where ‘groups should receive similar treatments or outcomes’, meaning groups like males or females should receive similar job acceptance rates.
Two popular definitions are Demographic Parity and Equalized Odds: 4
This means that the acceptance rates of the applicants from the two groups must be equal (for example 50% of male and 50% of female applicants get the job).
There is guideline precedent behind this rationale - called the four-fifths rule7: females who apply and are hired should be no fewer than four-fifths of the males who get the job.
Immediate and Long term benefits: the intervention rebalances the numbers (in this case) of female and male hires. This may have long term benefits as well. 8
Demographic parity (with laziness) 9
What can go wrong:
The notion permits that a classifier selects qualified applicants in the demographic A=0, but unqualified individuals in A=1, so long as the percentages of acceptance match. The above scenario can arise naturally when there is less training data available about a minority group. As a result, the company who hires might have a much better understanding of who to hire in the majority group, while essentially random guessing within the minority.10
Where to use it:
We are aware of the historic bias that may have an impact on our data and we need to have a plan in action to support the historically marginalized group.
For example, Oxford University aims to improve diversity by admitting more students from disadvantaged backgrounds. Their plan also includes extra support for students from disadvantaged backgrounds before beginning their degree courses.
The probability of a qualified applicant being hired and the probability of an unqualified applicant not being hired should be the same for males and females.
As compared to the demographic parity example, if a large number of unqualified male applicants apply for the job, the hiring of qualified female applicants in other protected groups is not affected. Unlike Demographic Parity, where we just hire 50-50 randomly and that counts as fairness, this definition selects between appropriate people from both groups to hire. It therefore penalizes laziness, because hiring unqualified applicants get penalized.
Equalized odds 13
What can go wrong:
An equal odds classifier must classify everyone as poorly as the hardest group, which is why it costs over twice as much in this case. This also leads to more conservative allocation of job positions, so it is slightly harder for job-fitting people of all groups to get the job.14
It still might not help closing the gap between the 2 groups in the long -term
Assume group A has 100 applicants and 58 of them are qualified while group B also has 100 applicants but only 2 of them are qualified. If the company decides to accept 30 applicants and satisfy equality of opportunities, 29 offers will be conferred to group A while only 1 offer will be conferred to group B.
If the job is a well-paid job, group A will tend to have a better living condition and afford better education for their children, and thus enable them to be qualified for such well-paid jobs when they grow up. The gap between group A and group B will tend to be enlarged over time.
Where to use it:
There is a strong emphasis on predicting the positive outcome correctly (e.g.: correctly identifying who should get a loan as it drives profits for the bank), and
We strongly care about minimising costly False Positives (e.g.: reducing the grant of loans to people who would not be able to pay back )
Individual Fairness Metrics
Another way to define fairness metrics is from the point of view of the individuals, called individual fairness metrics where ‘similar people should be treated similarly’. We will use the Generalized entropy index, which is a measure of inequality (used in economics to measure the distribution of income and economic inequality ).
Generalized entropy index15
Measures how evenly members of gender groups are distributed within the applications. A value of zero represents perfect equality and higher values denote increasing levels of inequality.
Individual fairness is more fine-grained than any group-notion fairness: it imposes restriction on the treatment for each pair of individuals.
It has been proposed as a measure of income inequality in a population.
What can go wrong:
Individual fairness relies entirely on how you define "similarity" between applicants, and you can run the risk of introducing new fairness problems if your similarity metric misses important information. It is hard to determine what is an appropriate metric function to measure the similarity of two inputs
Imagine three job applicants, A, B and C. A has a bachelor degree and 1 year related work experience. B has a master degree and 1 year related work experience. C has a master degree but no related work experience. Is A closer to B than C? If so, by how much? It becomes even worse when sensitive attribute(s) (like gender or race) come into the play. If we should and how to count for the difference of group membership in our metric function?
Groups Fairness vs Individual Fairness
Fairness metrics usually either emphasize individual or group fairness, but fail to combine both.
Many approaches to group fairness often tackle between-group issues (P.ex between groups of different gender or race), but this can in fact increase the within-group unfairness (between members of the same group). Reducing the between-group unfairness can exacerbate overall/individual unfairness:, the overall unfairness in fact goes up.16
The blue bar presents a simple model and the orange presents a model corrected for fairness. We verify that when the group fairness metrics improve (statistical parity, equal opportunity), the individual fairness metrics reduce17
Demographic Parity vs Equalized Odds
Even between the group fairness metrics, we can’t satisfy both demographic parity and Equalized odds.
The example below illustrates this impossibility. In our example, a company has 30 open positions and below there is a table separating the candidates into groups (ex. Gender: males/females), and separating the groups by who fits the job description.
Using demographic parity, first we look into these two groups and out of these we select half from one group half from the other. At the end, the company will hire 15 applicants from both groups (so inevitably hiring some unqualified applicants).
Using equalized odds, first we look into the people that fit the job description. That way, the company will hire 29 applicants from group A and one applicant from group B. It is mathematically proven that either Demographic Parity holds or Equalized Odds but not both.18
Fairness versus Accuracy Trade-off
Tradeoff between fairness and accuracy.19
Choosing what is fair comes with a cost. Creating generalized notions of fairness quantification is challenging. Quantitative definitions allow fairness to become an additional performance metric in the evaluation of an ML algorithm. However, increasing fairness often results in lower overall accuracy or related metrics, leading to the necessity of analyzing potentially achievable trade-offs in a given scenario. But sometimes greater accuracy in a model can lead to greater unfairness.
Most technical students are trained for accuracy, optimizing for getting the most correct predictions, but why not think about how it affects people? Trying to minimize unfairness and harm.
So we want to be fair to whom? And what happens with intersectionality? An algorithm that is fair in terms of gender and race, could be unfair in the intersection (for example women of color).
Fairness is complex
In conclusion, we see that the discussion about fairness is not an easy one.
First, fairness is highly contextual. There is no one-size-fits-all approach and it depends on the stakeholders and the application
Second, we see that there are no set answers. A lot of times cost and benefit decisions have to be made.
Finally, fairness is not a “measurement” as it implies a straightforward process, but a continuous process. The implication of “measurement” is, however, precarious as it implies a straightforward process. However, it should be seen as an investigative process that requires detection, explanation and mitigation. There is no single fairness checkpoint; harmful properties can enter a system under biased data and/or through data science practices and decisions. This triggers the need for strong internal governance, checklists and monitoring.
Prevue HR. ‘Adverse Impact Analysis / Four-Fifths Rule’, 6 May 2009. https://www.prevuehr.com/resources/insights/adverse-impact-analysis-four-fifths-rule/.
‘Approaching Fairness in Machine Learning’. Accessed 7 May 2021. http://blog.mrtz.org/2016/09/06/approaching-fairness.html.
D’Ignazio, Catherine, and Lauren F. Klein. Data Feminism. Strong Ideas Series. Cambridge, Massachusetts: The MIT Press, 2020.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. ‘Datasheets for Datasets’. ArXiv:1803.09010 [Cs], 19 March 2020. http://arxiv.org/abs/1803.09010.
Hardt, Moritz, Eric Price, and Nathan Srebro. ‘Equality of Opportunity in Supervised Learning’. ArXiv:1610.02413 [Cs], 7 October 2016. http://arxiv.org/abs/1610.02413.
Hardt, Moritz. ‘Equality of Opportunity in Supervised Learning’. ArXiv:1610.02413 [Cs], 7 October 2016. http://arxiv.org/abs/1610.02413.
Hu, Lily, and Yiling Chen. ‘A Short-Term Intervention for Long-Term Fairness in the Labor Market’. In Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW ’18, 1389–98. Lyon, France: ACM Press, 2018. https://doi.org/10.1145/3178876.3186044.
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. ‘Inherent Trade-Offs in the Fair Determination of Risk Scores’. ArXiv:1609.05807 [Cs, Stat], 17 November 2016. http://arxiv.org/abs/1609.05807.
Shorrocks, A. F. ‘The Class of Additively Decomposable Inequality Measures’. Econometrica 48, no. 3 (April 1980): 613. https://doi.org/10.2307/1913126.
Speicher, Till, Hoda Heidari, Nina Grgic-Hlaca, Krishna P. Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar. ‘A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality Indices’. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2239–48. London United Kingdom: ACM, 2018. https://doi.org/10.1145/3219819.3220046.
Zafar, Muhammad Bilal, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. ‘Fairness Constraints: Mechanisms for Fair Classification’. ArXiv:1507.05259 [Cs, Stat], 23 March 2017. http://arxiv.org/abs/1507.05259.