Abstract

ObjectivesText classification models can be used to automatically categorize occupations and causes of death within historical documents. It is important to classify/code these categories as different words or textual descriptions could refer to the same occupation or cause of death. Given the many historical documents that are becoming available for research, accurate classification systems can be valuable resources. ApproachWe explore different text classification techniques, from traditional machine learning to deep learning, and investigate methodologies that transform occupations and causes of death into a vectorial space and use these representations as features to train text classification systems. Our data come from IPUMS USA/International, and SCADR. ResultsHistorians have coded occupations and causes of death for some census collections (e.g., US, Canada), but not yet for others (e.g., Scotland). We train and evaluate our classification systems using data from the US and Canada and then deploy it on data from Scotland. We quantitatively measure the performance of the classification systems for historical documents that have codes available. Additionally, once we deploy the model to data that does not yet have codes, we qualitatively evaluate our results by engaging with historians working on those data. We report and discuss these results to understand where the models are performing well and where the models are underperforming. ConclusionsResults suggest that there is value in building and deploying these classification models. We recommend the use of such models in conjunction with engaging with domain experts.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.