FIPSER: Improving Fairness Testing of DNN by Seed Prioritization
As a rapidly evolving AI technology, deep neural networks are becoming increasingly integrated into human society, yet raising concerns about fairness issues. Previous studies have proposed a metric called causal fairness to measure the fairness of machine learning models and proposed some search algorithms to mine individual discrimination instance pairs (IDIPs). Fairness issues can be alleviated by retraining models with corrected IDIPs. However, the number of samples that are used as seeds for these methods is often limited due to the pursuit of efficiency. In addition, the quantity of IDIPs generated on different seeds varies, so it makes sense to select appropriate samples as seeds, which has not been sufficiently considered in past studies. In this paper, we study the imbalance in IDIP quantities for various datasets and sensitive attributes, highlighting the need for selecting and ranking seed samples. Then, we proposed FIPSER, a feature importance and perturbation potential-based seed prioritization method. Our experimental results show that, on average, when applied to the current state-of-the-art method of IDIP mining, FIPSER can improve its effectiveness by 45% and efficiency by 11%.
- Research Article
- 10.1377/hlthaff.27.2.581
- Mar 1, 2008
- Health Affairs
This slim book is a trenchant guide to the methods, uses, and politics of “fair tests” of the effectiveness of interventions for preventing, diagnosing, and treating disease. By fair tests the authors mean research that evaluates interventions by identifying bias and taking proper account of the laws of chance. The authors avoid the ambiguous and often embattled phrase “evidence-based” in discussing this research. The methodology of fair testing, elaborated over many years, has advanced especially rapidly since the 1970s. These methods are now being used globally to evaluate drugs, diagnostic and screening tests, and surgical procedures. In the United States, national policy to prioritize, subsidize, and disseminate the results of fair tests that compare the clinical and cost-effectiveness of competing interventions has recently become politically plausible. The best-known fair-test methodologies are randomized controlled trials (RCTs) and systematic reviews. These reviews, which are currently the most rigorous fair tests, once only aggregated and evaluated data from RCTs. In recent years, however, reviewers have been taking account of data from less rigorous trials, as well as from observational and even qualitative studies. Other approaches to fair testing are evolving: for example, simulations, patient registries, and the development of evidence as a condition of coverage. Effective Care in Pregnancy and Childbirth (ECPC), two volumes published in 1989, applied the methodology of fair testing to an entire field of patient care for the first time. Iain (now Sir Iain) Chalmers, a coauthor of Testing Treatments, was a principal organizer and author of ECPC. Several years later Chalmers took the lead in organizing an international collaboration to set standards for systematic reviews, as well as to conduct and publish them. More than 14,000 reviewers in about ninety countries now participate in the Cochrane Collaboration (named after Archie Cochrane, a pioneer of fair testing). In 1987, two years before the publication of ECPC, fewer than 100 systematic reviews appeared in the international literature of the health sector; in 2006, around 2,500 did. Many other organizations also promote, conduct, and sponsor fair tests of interventions to maintain and improve health. For most of the 1990s the United States lagged behind Australia, Canada, and the United Kingdom in developing and applying the methods of fair testing. During the current decade, however, attention to fair tests in the United States has increased, especially among agencies of the federal government and the states, integrated delivery systems and insurers, nonprofit research organizations, and the pharmaceutical industry. Testing Treatments is the best available introduction to the methods, uses, and value of fair testing. The authors draw most of their examB o o k R e v i e w s
- Research Article
3
- 10.1080/00098655.1995.9957243
- Apr 1, 1995
- The Clearing House: A Journal of Educational Strategies, Issues and Ideas
America is in the midst of a testing explosion. Tests, especially norm-referenced, multiple choice tests, have proliferated so greatly over the past several decades here and, increasingly, worldwide that people feel their impact at many different points in their lives. A growing concern in the United States about this proliferation brought national leaders of civil rights, consumer, education, and student organizations together in 1985 for a conference to discuss testing issues. Though all the groups were deeply concerned about the growing impact of standardized exams, none had the resources to make testing reform a top priority. Conference participants wanted to create an organization that would function as a bridge between the civil rights community and education reformers. The National Center for Fair and Open Testing (FairTest) was the result, and it remains the only national organization devoted solely to testing reform. FairTest is committed to advancing assessment systems, policies, and practices that help ensure fairness, equity, and excellence for all members of society. Testing often creates and reinforces barriers to equal opportunity based on race, class, gender, language, culture, and disability. The history of testing is rife with examples of how an often flawed technology is misused to determine the fates of individuals and to shape policy decisions:
- Research Article
2
- 10.1080/0267152950100104
- Mar 1, 1995
- Research Papers in Education
At its most fundamental level, experimental design has three major structural characteristics: the independent variable, the dependent variable and the control variable(s). Pupil competence in experimental design might involve simply recognizing these three types of variable, and being able to design a ‘fair test’. However, the relationships between these variables within a coherent design strategy must be understood before pupils could be said to have developed a systematic approach to experimental design. There can be little doubt that the ability to design an experiment constitutes a major part of the rationale for the recent development of a process approach to science and the way in which it is taught in schools. In such a rationale it is the pursuit of the methodology of science which is seen as the best route to ensuring a complete science education. It is therefore timely to consider pupil performance in this area. This paper reports performance results in this skill area, derived from packages of questions designed to shed light on the extent to which 11‐ and 13‐year‐old pupils can control variables, and how factors such as question format and context affect their performance. The performance results, augmented by analyses of variance, indicate that it is factors within the questions, rather than the skill itself, which lead to large variations in facility. It is also shown that the theoretical relationship between questions aimed at assessing similar aspects of experimental design is not reflected in terms of pupil performance. This volatility of performance and lack of association between theoretically related questions is also apparent for individual pupils. Moreover, there are aspects of the question itself, rather than its supposed cognitive ‘demand’, which are the most significant performance determinants. It is therefore unwise, and indeed spurious, to measure attainment in one context or in one format only. Quite subtle variations in the way questions are asked, the criteria against which pupils’ responses are judged, and the way in which their responses are scored, can give rise to very different conclusions about levels of attainment. This would indicate that continuous assessment by teachers will be an essential adjunct to the more formal but narrowly focused external tests (Standard Assessment Tasks). There is now an urgent need to clarify what is meant when we talk about the need to ensure a ‘fair test’. Are we simply seeking to make pupils aware of the structural elements of an experiment such as the dependent variable and how to measure it, or are we aiming to make them aware of the need to carry out and to plan well‐designed experiments? If the former is the case, then these results suggest that we are achieving a certain amount of success, presumably because of the recent emphasis on teaching science as a process rather than as a body of facts. However, if the latter is the case, then the data presented in this paper show that there is still much to be done, not least if pupils are to fulfil National Curriculum attainment criteria, and to apply their expertise to novel contexts.
- Research Article
- 10.1111/j.1744-6570.1968.tb00320.x
- Jun 1, 1968
- Personnel Psychology
SummaryA job‐related, “fair test” of ability, when used as one tool and not as the sole determining factor, has been established by arbitrators to be an appropriate selection instrument. If a test has been determined as being job‐related (i.e., related to the actual performance of the job), has been administered and scored both fairly and consistently, it will be considered a “fair test.” Generally, in unioncompany contracts that mention testing, a “fair test” contains the above qualifications. Arbitrators have indicated that the union involved should be afforded the opportunity to see the test; it was made quite clear, however, that the union should not have the test.In all cases reviewed, tests were upheld by arbitrators when they (1) were “fair tests,” and (2) did not conflict with the contract language. Past testing practice was not a determining factor when these two conditions existed.A total of 69 cases between 1953 and 1967 have been found relevant–27 from 1953 to 1962 and 42 from 1963 to 1967. In the first study, 13 cases were lost by a company; in six of these cases a company violated the union‐company contract and in seven cases the test used did not qualify as a “fair test.” A total of twelve cases were lost by a company in the second study; in two cases a company violated the union‐company contract and in ten cases the test used did not qualify as a “fair test.”
- Research Article
3
- 10.2139/ssrn.1381293
- Apr 14, 2009
- SSRN Electronic Journal
Many countries have fair employment laws to protect racial, gender, religious or ethnic minorities from discrimination and courts in the USA can order remedies such as one out of every three new hires should be a member of a protected group after finding an employer discriminated. What steps can an employer undertake to ensure its employment practices do not disadvantage minorities when it does not need to comply with a court order? This issue arose in Ricci v. DeStefano, a ‘reverse discrimination’ case under review by the U.S. Supreme Court. Seventeen Whites and 1 Hispanic who achieved sufficiently high scores qualifying them for promotion to lieutenant or captain of the New Haven Fire Department sued the city because it cancelled the examinations after seeing that no African American could be appointed to an existing vacancy. The City of New Haven justified its action on the basis that both examinations had a disparate impact on African Americans and Hispanics because the ratios of their pass rates to that of Whites were less than 80%, contrary to a ‘rule of thumb’ in the government's Uniform Guidelines. The city did not conduct statistical tests, which are referred to in the guidelines. The lower courts accepted New Haven's explanation and granted summary judgement to it. A statistical study of the various criteria considered by the city and lower courts in their review of the data demonstrates that nearly 70% of the time a fair non-discriminatory test for either position will fail the government's ‘80% rule’ and at least 60% of the time both fair tests would fail this ‘four-fifths rule’. Since the city created a new criterion after seeing the results, it is difficult to formulate precisely the other ‘rare’ or ‘unusual’ outcomes that would lead to cancellation of the examination. Would New Haven reject a list with no Hispanics or no Whites eligible for an immediate promotion? Would it require that all three groups be represented in the pool eligible for advancement to each position? From the viewpoint of statistical theory, the hypothesis being tested and the definition of pass or selection rates that will be compared should be decided before examining the data. Formal statistical tests on several relevant pass rates show that the lieutenant examination had a disparate impact on minority applicants, but the differences in the pass rates on the captain examination were not close to statistical significance. Furthermore, when the city cancelled both examinations, it only focused on the demographic mix of the high scorers who could receive an immediate promotion and ignored the 2-year life cycle of the list. Neither likely retirements nor job turnover during the 2-year life cycle of the results were considered. If this had been done, the city might have realized that three African Americans were likely to be appointed lieutenants along with two Hispanic captains.
- Research Article
9
- 10.1093/lpr/mgp017
- Jun 1, 2009
- Law, Probability and Risk
Many countries have fair employment laws to protect racial, gender, religious or ethnic minorities from discrimination and courts in the USA can order remedies such as one out of every three new hires should be a member of a protected group after finding an employer discriminated. What steps can an employer undertake to ensure its employment practices do not disadvantage minorities when it does not need to comply with a court order? This issue arose in Ricci v. DeStefano, a ‘reverse discrimination’ case under review by the U.S. Supreme Court. Seventeen Whites and 1 Hispanic who achieved sufficiently high scores qualifying them for promotion to lieutenant or captain of the New Haven Fire Department sued the city because it cancelled the examinations after seeing that no African American could be appointed to an existing vacancy. The City of New Haven justified its action on the basis that both examinations had a disparate impact on African Americans and Hispanics because the ratios of their pass rates to that of Whites were less than 80%, contrary to a ‘rule of thumb’ in the government’s Uniform Guidelines. The city did not conduct statistical tests, which are referred to in the guidelines. The lower courts accepted New Haven’s explanation and granted summary judgement to it. A statistical study of the various criteria considered by the city and lower courts in their review of the data demonstrates that nearly 70% of the time a fair non-discriminatory test for either position will fail the government’s ‘80% rule’ and at least 60% of the time both fair tests would fail this ‘four-fifths rule’. Since the city created a new criterion after seeing the results, it is difficult to formulate precisely the other ‘rare’ or ‘unusual’ outcomes that would lead to cancellation of the examination. Would New Haven reject a list with no Hispanics or no Whites eligible for an immediate promotion? Would it require that all three groups be represented in the pool eligible for advancement to each position? From the viewpoint of statistical theory, the hypothesis being tested and the definition of pass or selection rates that will be compared should be decided before examining the data. Formal statistical tests on several relevant pass rates show that the lieutenant examination had a disparate impact on minority applicants, but the differences in the pass rates on the captain examination were not close to statistical significance. Furthermore, when the city cancelled both examinations, it only focused on the demographic mix of the high scorers who could receive an immediate promotion and ignored the 2-year life cycle of the list. Neither likely retirements nor job turnover during the 2-year life cycle
- Research Article
2
- 10.2139/ssrn.1847007
- May 22, 2011
- SSRN Electronic Journal
Executive Summary:1. The Work Choices reforms substantially altered the rules for making agreements. This report identifies at least 15 ways in which the legal framework has shifted the balance of bargaining power away from employees.2. The introduction of a ‘Fairness Test’ purports to remedy the effects of just one of these changes: the removal of the ‘no disadvantage test’. It is clear that this single measure will be unable to address the multiple ways in which the framework undermines the bargaining position ofemployees.3. Contrary to the Government’s assertions, the Fairness Test is not, by any measure, stronger Than the former ‘no disadvantage test’ - the new test is clearly narrower in scope and provides fewer procedural protections than the former test. Overall, there must be considerable doubt that The Fairness Test will provide outcomes which are procedurally or substantively fair for employees.4. The emerging evidence of outcomes under workplace agreements confirms that the potential for the new framework to undermine the bargaining position of employees has been realised. Data on employer greenfields agreements strongly suggests that substantial numbers of employees have received no compensation for the removal of protected award conditions via these agreements. A combination of statistical and anecdotal evidence leads to a similar conclusion in relation to AWAs.5. In the case of collective agreements, the report highlights a number of templates which are being used to set the terms and conditions of employment for retail and hospitality workers. Following a reduction in the involvement of traditional third parties in agreement-making (ie, the AIRC and unions), these templates have been adopted (often without alteration) by many employers. The effect is to allow an alternative third party - the industrial relations consultant - to exercise significant influence over the content of agreements.6. The widespread replication of these templates in collective agreements in the retail and hospitality industries suggests that there is very little genuine bargaining taking place. A study of the templates themselves reveals the extent to which it is possible for an employer to reduce and remove employee benefits through the powerful mechanism of the Work Choices workplace agreement. The templates provide instances of the reduction of employee rights of control over hours of work, rostering, job location and job functions. The effect of these provisions is not only to displace conditions from awards and State legislation, but also to jeopardise the rights of an employee under his or her individual contract of employment.7. The report also highlights some of the problems which have arisen because of the removal of a certification or vetting process before agreements are approved. The existence of provisions in Agreement-Making under Work Choices III agreements which fall below the ‘safety net’, or which mislead employees about their legal entitlements, suggests that the new framework is failing to ensure compliance with the basic legal rules.8. The legal framework also appears to legitimate certain unfair employer bargaining practices by removing any positive requirement for employers to explain the effect of workplace agreements to employees, or to obtain genuine approval for these agreements, and by providing only weak protections against false or misleading conduct and duress. These unfair (but not unlawful) practices include: offering AWAs on a take-it-or-leave-it basis to new employees; using employer greenfields agreements on new projects to set a low base of employment conditions and to create a union-free environment; and ‘starving out’ employees by holding back pay rises until the employees enter into AWAs.9. Perhaps emboldened by the environment created by Work Choices, some employers are engaging in unlawful bargaining practices, such as targeting employees who refuse to sign AWAs by reducing their shifts, or threatening to remove other employee benefits, or ending their employment.10. Fundamental changes, not stop-gap measures, are required to address the bargaining practices and agreement outcomes which are permitted, and to some extent encouraged, under the Work Choices framework. Without legislative reform to ensure genuine bargaining and compliance with the agreement-making rules, it is inevitable that the working conditions of vulnerable employees will be further diminished.
- Research Article
2
- 10.53350/pjmhs22162964
- Feb 26, 2022
- Pakistan Journal of Medical and Health Sciences
Objectives: To determine the prevalence of piriformis muscle tightness among allied health students, and its relationship with age, gender and year of study. Methods: A cross sectional study was conducted. A sample size of 259 was calculated using Open Epi v3.01. Allied health students from physical therapy, occupational therapy and prosthetics & orthotics programs participated in this study, which lasted 4 months. After obtaining informed consent, data was collected through a self-developed questionnaire. Piriformis muscle tightness, and symptom recurrence, was determined using the FAIR test. Results: Average age of participants was 21.94±1.81 years. Females comprised of 79.9% of the study population. High percentage (85.3%) were from the physical therapy program. Most common posture was crossed leg sitting (48.3%). Positive FAIR test was found in 41.7% of the population. Strong correlation between age-group with FAIR test was noted (p=0.036). Conclusion: Piriformis muscle tightness is prevalent in those who engage in prolonged sitting postures. Furthermore, strong association of piriformis tightness with age is present, whereas no relationship with gender and year of study has been observed. Keywords: Piriformis muscle tightness, Piriformis muscle syndrome, Low back pain, Sedentary individuals, FAIR test.
- Research Article
46
- 10.1145/3652155
- Jun 4, 2024
- ACM Transactions on Software Engineering and Methodology
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this article offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely adopted datasets and open-source tools for fairness testing.
- Research Article
26
- 10.1007/s10664-022-10116-7
- Mar 30, 2022
- Empirical Software Engineering
ContextMachine learning (ML) software systems are permeating many aspects of our life, such as healthcare, transportation, banking, and recruitment. These systems are trained with data that is often biased, resulting in biased behaviour. To address this issue, fairness testing approaches have been proposed to test ML systems for fairness, which predominantly focus on assessing classification-based ML systems. These methods are not applicable to regression-based systems, for example, they do not quantify the magnitude of the disparity in predicted outcomes, which we identify as important in the context of regression-based ML systems.Method:We conduct this study as design science research. We identify the problem instance in the context of emergency department (ED) wait-time prediction. In this paper, we develop an effective and efficient fairness testing approach to evaluate the fairness of regression-based ML systems. We propose fairness degree, which is a new fairness measure for regression-based ML systems, and a novel search-based fairness testing (SBFT) approach for testing regression-based machine learning systems. We apply the proposed solutions to ED wait-time prediction software.Results:We experimentally evaluate the effectiveness and efficiency of the proposed approach with ML systems trained on real observational data from the healthcare domain. We demonstrate that SBFT significantly outperforms existing fairness testing approaches, with up to 111% and 190% increase in effectiveness and efficiency of SBFT compared to the best performing existing approaches.Conclusion:These findings indicate that our novel fairness measure and the new approach for fairness testing of regression-based ML systems can identify the degree of fairness in predictions, which can help software teams to make data-informed decisions about whether such software systems are ready to deploy. The scientific knowledge gained from our work can be phrased as a technological rule; to measure the fairness of the regression-based ML systems in the context of emergency department wait-time prediction use fairness degree and search-based techniques to approximate it.
- Research Article
- 10.1093/qjmed/hcab094.009
- Oct 1, 2021
- QJM: An International Journal of Medicine
Background The most important prognostic factor in squamous cell carcinoma of the head and neck (HNSCC) is the presence or absence of clinically involved neck nodes. The presence of metastases in a lymph node is said to reduce the 5-years survival rate by about 50%. The appropriate diagnosis of the presence of metastatic node is very important for the management of HNSCC Aim To compare difTerent diagnostic modalities for assessment of the clinically non palpable lymph nodes in HNSCC including by meta-analysis: CT, MRI, US, USFNAC and PET/CT for the proper cervical lymph node staging. Methods Met-analysis study on patients with HNSCC of clinically non palpable lymph nodes (cN0). Results Analysis was divided in 6 groups .Each group contain analysis of one modality according to available studies per patient, per level and per lesion .US is fair test per patient and per lesion.CT is good test per patient and excellent test per lesion.MRI is poor test per patient and fair test per lesion.CT-MRl combined is fair per patient and excellent per level.PET/CT is good per patient, fair per lesion and excellent per level. USFNAC is excellent per lesion. Conclusion CT, CT-MRI combined, PET/CT and USFNAC proved to be excellent in detecting cN0.MRI was poor test in detecting cN0.US was a fair test in detecting cN0 if used alone.
- Research Article
- 10.1145/3737697
- Feb 13, 2026
- ACM Transactions on Software Engineering and Methodology
Fairness testing aims at mitigating unintended discrimination in the decision-making process of data-driven AI systems. Individual discrimination may occur when an AI model makes different decisions for two distinct individuals who are distinguishable solely according to protected attributes, such as age and race. Such instances reveal biased AI behavior, and are called Individual Discriminatory Instances (IDIs). In this article, we propose an approach for the selection of the initial seeds to generate IDIs for fairness testing. Previous studies mainly used random initial seeds to this end. However, this phase is crucial, as these seeds are the basis of the follow-up IDI generation. We dubbed our proposed seed selection approach I&D . It generates a large number of initial IDIs exhibiting a great diversity, aiming at improving the overall performance of fairness testing. Our empirical study reveals that I&D is able to produce a larger number of IDIs with respect to four state-of-the-art IDI generation approaches, generating 1.86X more IDIs on average. When using the IDIs generated with I&D for retraining a machine learning model, the percentage of IDIs in the input space \(\mathbb{I}\) is decreased by 24.9% on average, implying that I&D is effective for improving the model’s fairness.
- Research Article
1
- 10.21592/eucj.2023.42.65
- Aug 31, 2023
- European Constitutional Law Association
The proportionality plays an important role in the case law of Europenan Court of Human Rights. The Court’s proportionality test is classified as a ‘horizontal’ proportionality test, setting it apart from the vertical proportionality test used by the Constitutional Court of Korea. The horizontal proportionality test of the European Court of Human Rights does not adhere to any particular order of assessment but instead focuses its reasoning on a fair balance test. The fair balance test is similar to the fourth prong of proportionality test of the Korean Constitutional Court. The essence of proportionality, from a theoretical standpoint, lies in determining whether the specific level of restriction on a right is worth being endured compared to the degree of benefits achieved by the actions of public authorities. In the jurisprudence of the European Court of Human Rights, the fair balance test perform this balancing task. Hence, the European Court of Human Rights' reasoning structure, which places a fair balance at its core, is theoretically sound. In applying the fair balancing test, the European Court of Human Rights considers not only the relevance to the fundamental values of the European Convention on Human Rights but also various specific circumstances of each case. This enables the Court to assess the concrete importance of the rights guaranteed by the Convention and the concrete value of the measures taken by member states, moving beyond their abstract value. While the European Court of Human Rights applies the “margin of appreciation” doctrine in their assessment of the fair balance, the reasoning structure still remains within the framework of a balancing test. The Constitutional Court of Korea also considers the balance of interests, although decisions where this prong predominate are rare. Nevertheless, there have been recent decisions that demonstrate a dedicated commitment to the balance of interests, both quantitatively and qualitatively. This new trend is commendable, both from a theoretical perspective and in comparison to the practices of the European Court of Human Rights.
- Research Article
27
- 10.1109/tse.2021.3101478
- Sep 1, 2022
- IEEE Transactions on Software Engineering
Although deep learning has demonstrated astonishing performance in many applications, there are still concerns on their dependability. One desirable property of deep learning applications with societal impact is fairness (i.e., non-discrimination). Unfortunately, discrimination might be intrinsically embedded into the models due to discrimination in the training data. As a countermeasure, fairness testing systemically identifies discriminative samples, which can be used to retrain the model and improve its fairness. Existing fairness testing approaches however have two major limitations. First, they only work well on traditional machine learning models and have poor performance (e.g., effectiveness and efficiency) on deep learning models. Second, they only work on simple tabular data and are not applicable for domains such as text. In this work, we bridge the gap by proposing a scalable and effective approach for systematically searching for discriminative samples while extending fairness testing to address a challenging domain, i.e., text classification. Compared with state-of-the-art methods, our approach only employs lightweight procedures like gradient computation and clustering, which makes it significantly more scalable. Experimental results show that on average, our approach explores the search space more effectively (9.62 and 2.38 times more than the state-of-art methods respectively on tabular and text datasets) and generates much more individual discriminatory instances (24.95 and 2.68 times) within reasonable time. The retrained models reduce discrimination by 57.2% and 60.2% respectively on average.
- Discussion
1
- 10.1016/s0140-6736(06)69093-4
- Jul 1, 2006
- The Lancet
Cuttings