Abstract

The advent of large-scale bibliographic databases and powerful prediction algorithms led to calls for data-driven approaches for targeting scarce funds at researchers with high predicted future scientific impact. The potential side-effects and fairness implications of such approaches are unknown, however. Using a large-scale bibliographic data set of N = 111,156 Computer Science researchers active from 1993 to 2016, I build and evaluate a realistic scientific impact prediction model. Given the persistent under-representation of women in Computer Science, the model is audited for disparate impact based on gender. Random forests and Gradient Boosting Machines are used to predict researchers’ h-index in 2010 from their bibliographic profiles in 2005. Based on model predictions, it is determined whether the researcher will become a high-performer with an h-index in the top-25% of the discipline-specific h-index distribution. The models predict the future h-index with an accuracy of R^2 = 0.875 and correctly classify 91.0% of researchers as high-performers and low-performers. Overall accuracy does not vary strongly across researcher gender. Nevertheless, there is indication of disparate impact against women. The models under-estimate the true h-index of female researchers more strongly than the h-index of male researchers. Further, women are 8.6% less likely to be predicted to become high-performers than men. In practice, hiring, tenure, and funding decisions that are based on model predictions risk to perpetuate the under-representation of women in Computer Science.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.