TMCnet - World's Largest Communications and Technology Community

TMCNet:  Evaluation of the Bayesian Method to Derive Migration Patterns from Changes in Surname Distributions over Time [Human Biology]

[December 24, 2013]

Evaluation of the Bayesian Method to Derive Migration Patterns from Changes in Surname Distributions over Time [Human Biology]

(Human Biology Via Acquire Media NewsEdge) Abstract Known migration in The Netherlands between the periods 19501969 and 2007, for 4.5 million individuals, was used to estimate the origin of migration by means of a Bayesian method on the basis of surname distributions in these two periods. Results of the method depend on the geographic specificity of the surnames and tend to be positioned between population density and actual probability of migration origin. An optimum in the correlation between estimated and actual percentages of origin of migration, and their differentiation as expressed by the correlation between the estimated and actual entropy across 40 distinguished areas, was found after a few iterations. The optimal correlation was 0.806 (Spearman), which shows that the Bayesian method provides a reasonable proxy of the rank order of a migrant's origin.


(ProQuest: ... denotes formulae omitted.) ABayesian method can be used to infer the geographical origin of migrants, based on comparison of changes in surname distribution observed in successive periods of time. Initially published in 2001, this method has been successively applied to data from various periods and regions, such as the Western Pyrenees in the 21st and 20th centuries (Degioanni and Darlu 2001); French cities in the 19th and 20th centuries (Darlu and Degioanni 2007); Saint Germain des Près in the 9th century (Chareille and Darlu 2010); Savoy in the 18th and 19th centuries (Darlu et al. 2011); and Québec in the 18th and 19th centuries (Brunet at al. 2012). However, although the method is theoretically correct, and although it provides consistent and interpretable results in terms of population migration, it deserves to be evaluated by controlled data. The problem was the difficulty in finding suitable data for a comparison of predicted and known migration.

The modem Dutch Civil Registration (Bloothooft et al. 2012) allows for such an evaluation. A sample was drawn from this register that includes the surname and the place of birth of all 4.5 million individuals of the cohort bom in The Netherlands during the period 1950-1969 (with a surname that occurred more than twice in all data), and the municipality where they were living half a century later, in 2007, was available. Hence, the origin of those who moved from their place of birth along a direct or complex migration trajectory to their place of residence in 2007 can be determined, leading to a demographic migration matrix based on exhaustive data. Since the surname of all individuals is known as well, with a total of 67,638 different names, one can simultaneously apply the Bayesian method (Degioanni and Darlu 2001) on the distribution of the surnames across places of birth and current places of residence on the same sample, which provides a matrix of probability of geographical origins (pgo). The aim of this article is to present and discuss the results of the comparison of the actual demographic migration matrix with the matrix of predicted geographic origins, reserving, for a later publication, the comparison with other methods described elsewhere as the Wijsman et al. method (1984).

Materials and Methods The modem Civil Registration of The Netherlands contains full population data for about 16 million persons. For each of them, records include, among others, family name, date, place and country of birth, and place of living. From this database, we could extract a sample including all 4,554,805 persons (male and female) bom in the period 1950-1969 and still alive and living in The Netherlands in 2007. This sample therefore describes the results of internal migrations during this time interval of most Dutch individuals who were between 38 and 57 years of age in 2007. As a geographic scale, we used a division of The Netherlands in 40 areas that are defined by the National Bureau of Statistics on the basis of a certain homogeneity of their population. Together, these geographic areas, named COROP (Coördinatiecommissie Regionaal Onderzoeksprogramma) areas, cover The Netherlands in full (see Figure 1). Some descriptive statistics of the COROP areas, among which the number of inhabitants used in this study and different surnames in 2007, are given in Table 1. We computed an asymmetric migration matrix M with the elements mg, which are the proportions of individuals living in 2007 in the yth COROP area after some migration trajectory from the ith COROP area where they were bom. Although this migration trajectory can be complex, for simplicity we refer to it as "the migration." The migration between two areas can also be estimated by looking at surnames skj that are absent at time t{ (1950-1969) and present at time t2 (2007) in a given COROP area ai. The indices in skj indicate surnames k that newly arrive in area j. The persons with these newly arriving surnames necessarily come from somewhere else. It is assumed that migration probabilities estimated for this group are representative for all newcomers (including those who have a surname that was already present in area j). The geographic origin of this specific group of migrants with new surnames for areas j can be detected using a Bayesian approach (Degioanni and Darlu 2001), which is briefly reported here. For any area a, under investigation (a COROP area in this case), called a "recipient area," by Bayes's theorem the probability that the surname skj migrated to this area a} from an other area ai, called the source area, is ...(1) where p(skj I ai) is the probabihty of observing the surname skj within the area ah which can be estimated as the ratio of the number of people with surname skj in area i to the total number of persons who bear this name at time tx, and Jt/a,) is the a priori probability of migrating into area ai from area ai whatever the surname. The sum in the denominator is taken over all geographic areas.

As for an area ai, this probability of origin is estimated for each surname Skj', one obtains a more accurate estimate of the overall probability of geographic origin of all surnames of migrants from the area ai, pgoÿ, by taking a weighted sum of pfa I skj) over all surnames. This pgOÿ, based on all surnames newly arriving in recipient area ai from source area ai between two instances of time, is ...(2) where (% is the number of persons bearing surname skj in recipient area a, at time t2. The sum of pgo/; over all areas i equals 1. Once these pgo/; probabilities are obtained, they are used as a new estimate of the a priori probability Tij(ai). These are replaced in the Bayesian formula, which is then recalculated. This iterative process is carried out until a convergence criterion 6 is met, defined as the mean of the absolute differences between the sums of all pgo values obtained from two consecutive iterations. Finally, we obtain an asymmetric matrix P, with elements pgOÿi the probability of being bom in COROP area ai for migrated individuals living in COROP area a, in 2007.

As a measure of correspondence between the M and P matrices, the correlation between their elements is calculated by a nonparametric Spearman rank-order correlation and by a Bravais-Pearson correlation after log-normalization of the skewed distribution of the m and pgo values.

To compare the geographic diversity of the origin of migrants, for each COROP area j, the entropy for both the M and the P matrices was computed as ...(3) The sum is over all COROP areas. The Spearman rank-order correlation between these two indices of diversity was also calculated. The entropy has a maximum value of log(40) =1.6 for a uniform distribution of the origins of migrants over the 40 COROP areas.

Results and Discussion Because the Bayesian method uses only the subset of newly arriving surnames in each area, it was verified that migration derived from this subset is representative for the whole data set. The overall correlation between m,-, from the total actual migration matrix M and the matrix M* based on this subset of surnames is 0.93 (Bravais-Pearson correlation) and 0.96 (Spearman rank-order correlation). This supports the assumption that migration estimated on the basis of new surname appearances is representative for total migration.

The Spearman rank-order correlation and the Bravais-Pearson correlation between log(m,7) and log(pgOy) were computed for a series of iterations (Table 2). It shows that highest correlation between the elements of the M and P matrices is found immediately at the second iteration, which involves the computation of pgOy on the basis of the a priori probability it, (a,) = 1/40 of immigration from all 40 areas. It follows from eq. 1 that in this special case Pj{ai I skj) = p(skj I a,) (which value is subsequently weighted according to eq. 2 to arrive at pgo,7).

When the set of surnames skj that are new to area j is geographically nonspecific ,p(skj I ai) for each area j would follow the population distribution d(ai) across the i areas where they are present and would affect pgo,, accordingly. In the opposite case, when surnames skj are entirely indicative for some source area g (i.e.,p(skj I ai) = 1 for i = g and p(skj I ai) = 0 for i * g), the combination of pfai I skj) = p(skj I ai) at first iteration will immediately describe the probability of origin of migration pgOy correctly.

Obviously, every surname will have some intermediate status between being nondifferential among geographic areas and uniquely describing one of them. The weighted mean (eq. 2) of all surnames, whatever their status, is a way to bring together all the information into pgo/;, which is refined in successive iterations.

The geographic specificity of the surnames in our study is shown in figure 2, which gives the percentage of surnames (of all names that are new to at least one area) as a function of the number of COROP areas where they are present. It shows that 68.4% of the surnames have presence in six or fewer areas, and 31.4%, in more than six areas. This indicates a more realistic potential for the first group to accurately reflect origin of migration, whereas the second group provides information less rehable at this point and will put an emphasis on population density.

The pgOy values at the second iteration already deviate significantly from the a priori value of 1/40, but the diversity in the origin of migrants, as estimated by entropy of the P matrix, hf, is still weaker and more uniformly distributed than is entropy of the M matrix, hj'(as shown in Figure 3 for all 40 COROP areas); the Spearman rank-order correlation between ff and h"' is low (p = 0.511, n = 40). At this point, If is still close to the maximum value of log(40) =1.6, and little differentiation of its value among COROP areas (between 1.35 and 1.5) is seen.

For higher numbers of iterations, the correlation between log(m/y) and log(pgOÿ) decreases, while the Spearman rank-order correlation between hp and hm increases. We decided for an optimum at eight iterations, at the point where the convergence criterion e stabilizes near zero (Figure 4). Figure 5 shows the distribution of mij and pgOy and their covariation after the eighth iteration of pgo,7. The corresponding M and P matrices are shown in Table 3. The Bravais-Pearson correlation of 0.747 at iteration 8 implies that about 56% of the variance of the Bayesian estimation of migration can be explained by the actual migration. One cannot expect a perfect correlation between estimated and actual migration since the Bayesian estimation takes into account only the surnames of the individuals who are newly arriving in a given area.

Table 3 shows that differences between M and P can be considerable for the COROP areas with higher populations (which often include a major city; see Table 1). The migration from these areas is usually overestimated in P. We already mentioned that the pgoÿ describe migration probabilities with certainty only when each newly arriving surname is uniquely referring to some source area. Any deviation from this requirement leads the pgo/; to be geared towards population density and thus to estimate higher migration from more populated areas. This tendency is unavoidable and should be taken into account in the interpretation of results obtained by the Bayesian method.

The Bayesian method is based on the assumption that the newly arriving surnames are representative for all migration. This assumption may be violated for regional migration. Individuals with regional surnames may constitute an important part of regional migration, but they are excluded in the estimation of migration when areas in the region share these surnames. Regional effects can be studied through the correlation between M and P among COROP areas (Figure 6). COROP areas C1-C10 (Figure 1), which comprise the three northern provinces of Friesland, Groningen, and Drenthe and the northern part of the province of Overijssel, have the lowest correlation (< 0.80), together with the northern part of the province of NorthHolland (C18) and the newly reclaimed polders of Flevoland (C40). Apart from the latter area, the other areas have quite a few regional surnames of Frisian origin. In such a case, the Bayesian method may underestimate the regional migration, leading to a lower correlation between M and P. The polders of Flevoland are special in that most of the population moved there after 1969 and most surnames are new to the area and highly representative for all migration. However, these surnames also included the more popular and widespread names, and the Bayesian method then tends to weight heavily the population density, with an emphasis on the COROP areas with major cities. While this is correct for the nearby Amsterdam area, it was not for the other cities, leading to a slightly lower correlation between M and P. For the rest of the country, which includes the COROP areas with highest number of inhabitants, the correlation exceeds 0.80.

The tendency of the Bayesian method to be influenced by population density will not necessarily be reduced in subsequent iterations. Our results even show a gradual degradation: the figures for migration tend to dissociate with increasing number of iterations: the higher migration percentages increase, and the lower percentages converge to zero (as exemplified by the decreasing number of pgo/; * 0 in Table 2). As concluded above, the best estimates are likely obtained with a limited number of iterations corresponding to stabilization of the convergence criterion .

In this investigation, we have used a data set that is special in several aspects: it is a complete and closed set of individuals for whom place of birth and place of residence in 2007 are known exactly. Excluded were individuals bom between 1950 and 1969 who moved out of the country or died after 1969, and those who were newly bom as children between 1969 and 2007 or immigrated from abroad. For historical data sets, such a selection cannot be made, and much more diffuse surname distributions can be anticipated. The estimated migration patterns derived by the Bayesian method on the basis of these surname distributions will then be less precise than found for the present data. Nevertheless, one can conclude that, despite the limitations of surname records, the Bayesian method provides a reasonably efficient and useful proxy of the origin of migrants. Best estimates are achieved by performing a few iterations, ensuring preservation of the level of diversity across areas. With the limitations of the method in mind, as observed in the present study, its application is useful since in a historical context where surname distributions often are the only available data that can shed some light on migration issues.

Received 22 February 2013; revision accepted for publication 17 May 2013.

Literature Cited Bloothooft, G., K. Mandemakers, L. Brouwer et al. 2012. Data mining in the Dutch (historical) Civil Registration (1811-present). Hum. Biol. 84:177-184.

Brunet, G., P. Darlu, B. Desjardins 2012. Writing the history of the Québec populations using surname frequencies. Hum. Biol. 84:188-194.

Chareille, P., and P. Darlu 2010. Anthroponymie et migration: Quelques outils d'analyse et leur application à l'étude des déplacements dans les domaines de Saint-Germain-des-Près au IXe siècle. In Anthroponymie et migrations dans la chrétienté médiévale, ed. Monique Bourin and Pascual Martínez Sopeña. Collection de la Casa Velasquez, Madrid 116. Madrid: Casa de Velázquez, 41-73.

Degioanni, A., P. Darlu 2001. A Bayesian approach to infer geographical origins of migrants through surnames. Ann. Hum. Biol. 28:537-545.

Darlu, R, G. Brunet, and D. Barbero. 2011. Spatial and temporal analyses of surname distributions to estimate mobility and changes in historical demography: The example of Savoy (France) from the XVIIIth to XXth century. In Navigating Time and Space in Population Studies, M. P. Gutmann, G. D. Deane, E. R. Merchant et al., eds. International Studies in Population. London: Springer, 99-114.

Darlu, R, and A. Degioanni 2007. Localisation de l'origine géographique de migrants par la méthode patronymique: L'exemple de quelques villes de France au début du XXème siècle. Espace Geogr. 36:251-265.

Wijsman, E., G. Zei, A. Moroni et al. 1984. Surnames in Sardinia IL Computation of migration matrices from súmame distributions in different periods. Ann. Hum. Genet. 48:65-78.

GERRIT BLOOTHOOFT1 AND PIERRE DARLU2* 1Utrecht Institute of Linguistics-OTS, Utrecht University,The Netherlands.

2UMR7206 Eco-anthropology and Ethnobiology, Museum national d'Histoire naturelle. Centre national de la recherche scientifique. University Denis Diderot, Paris, France.

*Correspondence to: Pierre Darlu, UMR7206 Eco-anthropology and Ethnobiology, MNHN, CP135,57, rue Cuvier, 75231 Paris Cedex 05. E-mail:

(c) 2013 Wayne State University Press

[ Back To Technology News's Homepage ]


Technology Marketing Corporation

800 Connecticut Ave, 1st Floor East, Norwalk, CT 06854 USA
Ph: 800-243-6002, 203-852-6800
Fx: 203-866-3326

General comments:
Comments about this site:


© 2014 Technology Marketing Corporation. All rights reserved.