Abstract
This paper presents a method for classifying the ancestry of Brazilian surnames based on historical sources. The information obtained forms the basis for applying fuzzy matching and machine learning classification algorithms to more than 46 million workers in 5 categories: Iberian, Italian, Japanese, German and East European. The vast majority (96.7%) of the single surnames were identified using a fuzzy matching and the rest using a method proposed by Cavnar and Trenkle (1994). A comparison of the results of the procedures with data on foreigners in the 1920 Census and with the geographic distribution of non-Iberian surnames underscores the accuracy of the procedure. The study shows that surname ancestry is associated with significant differences in wages and schooling.
Highlights
Official census surveys in Brazil do not register information on the population’s ancestry
Only 293,634 of the 531,009 unique surnames found in the RAIS data were identified by fuzzy matching, the number corresponds to 96,7% of the workers
It should be noted that this result was obtained even with the adoption of the conservative option to attribute a value of 1 to the maximum distance in the Optimal String Alignment (OSA) algorithm The 3.3% of individuals in the RAIS whose names were not classified by the fuzzy matching were classified by the machine learning algorithm
Summary
Official census surveys in Brazil do not register information on the population’s ancestry. (IBGE, the Brazilian Statistical Office, uses the term “color/race”. We use this expression as a way to follow the national standard.) those categories do have social significance, they are often far too broad to allow for specific applications such as socioeconomic or epidemiological studies. This article contributes to the classification of the ancestry of Brazilian surnames. It innovates by using historical databases to associate surnames to ancestry and by applying machine learning algorithms to classification. To obtain the contemporary distribution of surnames, the study made use of the 2013 Annual Social Information Report (Relacão Anual de Informacões Sociais) hereafter referred to as the RAIS [1]. The database is a very large restricted-access administrative file that contains 46.8 million observations on all Brazilians workers in the formal labor market
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.