Abstract
This paper proposes two natural language processing models for extracting useful information from multilingual, unstructured (free form) CV documents. The model identifies the relevant document sections (personal information, education, employment, etc.) and the corresponding specific information at the lower hierarchy level (names, addresses, roles, skill competences, etc.). Our approach employs the transformer architecture and its multilingual implementation of the encoder part in the form of the BERT language model. The models are trained and tested on a large, manually annotated CV dataset, achieving high scores on standard accuracy measures. The proposed models exhibit important properties of end-to-end training and interpretability, which was investigated by visualizing the model attention and its vector representations.
Highlights
Automatic extraction of useful information from CVs given in free form is a difficult task in the area of natural language processing (NLP)
In our work, machine learning techniques are used in the context of NLP in order to achieve a high degree of accuracy in extracting the desired information in arbitrary format in five languages
This paper proposed a new architecture for processing sequential inputs using transformer, and the implementation of its encoder part in the form of the Bidirectional Encoder Representations from Transformers (BERT) language model
Summary
Automatic extraction of useful information from CVs given in free form is a difficult task in the area of natural language processing (NLP). A system which could convert a free-form CV into a given highly organized structure can be a very valuable tool to recruiters and various job market websites Useful information in this case includes personal information such as first and last name, residential addresses and spoken language, as well as information about past employments, education and skills or competences of the person. D. Vukadin et al.: Information Extraction from Free-Form CV Documents in Multiple Languages cision, recall and F1 scores on a dataset consisting of 1686 annotated CVs in five languages: English, Swedish, Norwegian, Finnish and Polish.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.