Abstract
BackgroundIn many research studies, the identification of social determinants is an important activity, in particular, information about occupations is frequently added to existing patient data. Such information is usually solicited during interviews with open-ended questions such as “What is your job?” and “What industry sector do you work in?” Before being able to use this information for further analysis, the responses need to be categorized using a coding system, such as the Canadian National Occupational Classification (NOC). Manual coding is the usual method, which is a time-consuming and error-prone activity, suitable for automation.ObjectiveThis study aims to facilitate automated coding by introducing a rigorous algorithm that will be able to identify the NOC (2016) codes using only job title and industry information as input. Using manually coded data sets, we sought to benchmark and iteratively improve the performance of the algorithm.MethodsWe developed the ACA-NOC algorithm based on the NOC (2016), which allowed users to match NOC codes with job and industry titles. We employed several different search strategies in the ACA-NOC algorithm to find the best match, including exact search, minor exact search, like search, near (same order) search, near (different order) search, any search, and weak match search. In addition, a filtering step based on the hierarchical structure of the NOC data was applied to the algorithm to select the best matching codes.ResultsThe ACA-NOC was applied to over 500 manually coded job and industry titles. The accuracy rate at the four-digit NOC code level was 58.7% (332/566) and improved when broader job categories were considered (65.0% at the three-digit NOC code level, 72.3% at the two-digit NOC code level, and 81.6% at the one-digit NOC code level).ConclusionsThe ACA-NOC is a rigorous algorithm for automatically coding the Canadian NOC system and has been evaluated using real-world data. It allows researchers to code moderate-sized data sets with occupation in a timely and cost-efficient manner such that further analytics are possible. Initial assessments indicate that it has state-of-the-art performance and is readily extensible upon further benchmarking on larger data sets.
Highlights
In many research studies, and for governmental or other statistical purposes, data collection includes gathering information on occupation
We developed the Automated Coding Algorithm (ACA)-National Occupational Classification (NOC) algorithm based on the NOC (2016), which allowed users to match NOC codes with job and industry titles
The design of the ACA-NOC algorithm shown in Figure 1 was arrived at through an iterative process of design, deployment, and testing
Summary
For governmental or other statistical purposes, data collection includes gathering information on occupation. Occupation is a widely used explanatory variable in health research, representing social status and class as well as exposure to environmental hazards [1] Such data are collected either by a self-completed questionnaire or by an interviewer. The identification of social determinants is an important activity, in particular, information about occupations is frequently added to existing patient data. Such information is usually solicited during interviews with open-ended questions such as “What is your job?” and “What industry sector do you work in?” Before being able to use this information for further analysis, the responses need to be categorized using a coding system, such as the Canadian National Occupational Classification (NOC). Manual coding is the usual method, which is a time-consuming and error-prone activity, suitable for automation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.