Abstract

This paper presents a method for classifying the ancestry of Brazilian surnames based on historical sources. The information obtained forms the basis for applying fuzzy matching and machine learning classification algorithms to more than 46 million workers in 5 categories: Iberian, Italian, Japanese, German and East European. The vast majority (96.7%) of the single surnames were identified using a fuzzy matching and the rest using a method proposed by Cavnar and Trenkle (1994). A comparison of the results of the procedures with data on foreigners in the 1920 Census and with the geographic distribution of non-Iberian surnames underscores the accuracy of the procedure. The study shows that surname ancestry is associated with significant differences in wages and schooling.

Highlights

  • Official census surveys in Brazil do not register information on the population’s ancestry

  • Only 293,634 of the 531,009 unique surnames found in the RAIS data were identified by fuzzy matching, the number corresponds to 96,7% of the workers

  • It should be noted that this result was obtained even with the adoption of the conservative option to attribute a value of 1 to the maximum distance in the Optimal String Alignment (OSA) algorithm The 3.3% of individuals in the RAIS whose names were not classified by the fuzzy matching were classified by the machine learning algorithm

Read more

Summary

Introduction

Official census surveys in Brazil do not register information on the population’s ancestry. (IBGE, the Brazilian Statistical Office, uses the term “color/race”. We use this expression as a way to follow the national standard.) those categories do have social significance, they are often far too broad to allow for specific applications such as socioeconomic or epidemiological studies. This article contributes to the classification of the ancestry of Brazilian surnames. It innovates by using historical databases to associate surnames to ancestry and by applying machine learning algorithms to classification. To obtain the contemporary distribution of surnames, the study made use of the 2013 Annual Social Information Report (Relacão Anual de Informacões Sociais) hereafter referred to as the RAIS [1]. The database is a very large restricted-access administrative file that contains 46.8 million observations on all Brazilians workers in the formal labor market

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call