Abstract

The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations. Using three datasets comprising of queries from multiple general-purpose Internet search engines we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines. Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 53% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known patient populations. We show that methods for cohort selection which use additional information beyond queries where users indicate their condition are less skewed. Finally, we show that biased training cohorts can lead to differential performance of models designed to detect disease from search engine queries. Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.