Abstract

Objective:We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database.Methods:We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded).Results:naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%).Conclusions:A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.

Highlights

  • Gender detection tools are increasingly used in medical research, to explore the gender gap in scientific publications, grants allocations, salaries, or career advancement processes [1,2,3]

  • When uploading the original database as a CSV file, we found that first names with diacritical marks, such as accents and cedilla, and compound first names with or without hyphens were often not recognized by genderize.io

  • By removing all diacritical marks and shortening all compound first names, we were able to greatly improve the accuracy of gender inference by genderize.io

Read more

Summary

Introduction

Gender detection tools are increasingly used in medical research, to explore the gender gap in scientific publications, grants allocations, salaries, or career advancement processes [1,2,3]. Their main advantage lies in the possibility of uploading large CSV or Excel files. A new column (gender) is added to the file. This procedure does not require extensive computer skills. Using genderize.io [5], Cevik et al found that women were significantly underrepresented as principal investigators of COVID-19 studies

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call