Demographic Change in France: The Numbers

A few years back certain medical data made a big wave in right wing circles never quite spilling over into mainstream media. The data in question consisted of percentages of newborns tested for sickle cell anemia in mainland France. In France, only newborns that have at least one parent originating from a region in which sickle cell anemia is common are tested for the disease. As sickle cell anemia is mostly prevalent in Africa, these percentages where taken as stand-in for the percentage of French newborns of African heritage.

The screening data suggested, that in 2000, 19 percent of babies born in mainland France (excluding oversea departments) were of African origin, a number that rose steadily to 38.9 percent in 2015. This is certainly surprising. If these numbers are correct, France’s ethnic makeup seems poised to jump from entirely European to basically Brazil within two generations.

To me, these data are worth investigating for several reasons.

During the last decades the media fed us a steady diet of articles about the French family friendly policies that were the reason for the birth rate collapse failing to materialize in France. It would certainly be interesting if that was just nonsense and the real reason was a more fecund (or just bigger) class of immigrants.

Ethnic replacement is a centerpiece of rightwing agitation. Of course, the media tells us that it is just a conspiracy theory. Just as with the French birth rate, I am very much interested in the extend of lies told to me by mainstream media outlets. Call it a desire for informational emancipation.

Ethnicity correlates with lots of variables of interest. Quantifying such a rapid change would allow predictions in crime rates, economic growth, human capital, unemployment, etc. Rapid change of any sort is often accompanied with many dangers. If you don’t know about the change, you can’t look out for the dangers.

There are several arguments against equating sickle cell screening with African origin. Among the countries that provided significant numbers of immigrants to France, sickle cell anemia is prevalent in Italy, Greece and Turkey aside from the Maghreb, Subsaharan Africa and the Caribbean. However, the number of recent European immigrants from sickle cell regions is too small to account for more than a few percentage points.

It has also been argued that some hospitals do not distinguish by origin, but instead test all newborns. That is entirely possible, however, it leads to a dilemma. If the absolute number of newborns at risk for sickle cell anemia is overestimated, the growth rate has to be underestimated!
Or to put it differently: If the 19% in 2000 were actually just 10% because 9% were due to unnecessary testing, than to get to 39% in 2015 the percentage of actual kids at risk had to triple from 10% to 30% instead of double from 19% to 39%.

Alternatively, the number of hospitals just testing everybody has steadily risen. In which case the entire data is worthless. Or the original study could just be a hoax by a devious far-right physician. Who knows?

So the first point on our agenda is trying to independently verify the plausibility of the data.

To this end, I downloaded the data for given names in France provided by the French bureau for statistics, INSEE [1]. I also create a list of 2211 popular Muslim names, specifically Arab and Turkish names. Not all of the sickle cell tested babies will be of Arab or Turkish origin. And not all kids of Arab and Turkish origin will be given Arab and Turkish names. And additionally, my list probably doesn’t cover more than small chunk of all actual Arab and Turkish names. But it still allows us to track the increase of a certain subset of all kids that would be subject to sickle cell testing.

A first quick and dirty run of the numbers: In 2000, out of 800039 kids my list covers 46718 or 5.8%. In 2015, my list covers 80387 out of 777746 names, or 10.3%. This amount to an estimated 1.77-fold increase of Arab/Turkish newborns over the time span in which the sickle cell percentage roughly doubled, which is reasonably close.

However, out of my 2211 names only 103 and 127 actually occur in the INSEE list of given names for the years 2000 and 2015. Only 77 names occur in both lists. Some of these names are clearly not just popular among Muslims, especially girl’s names are often ambiguous. So let’s try to tighten up the method.

Now, we only look at names present in both years. We exclude all ambiguous names. Each remaining name provides a separate estimate how much the percentage of Muslim newborns has changed between 2000 and 2015. This time the overall percentage accounted for by these 56 names almost exactly doubles from 1.95% to 3.89%. The median increase, which should be more robust against outliers (like short term trends in popularity), is also exactly 2.0.

To my mind this provides strong confirmation that the sickle cell data is correctly interpreted as showing that the percentage of a predominantly African derived immigrant population among the newborns in France has doubled between 2000 and 2015. Confirmation of the growth rate makes it rather unlikely that the absolute percentage numbers are off by any significant degree.

I did these analyses quite some time ago. At one point I became aware that my given name analysis had been scooped by a French far-right website. (Which was one motivation to finally get the blog going.) In their analysis they try to capture all Muslim names and give a definite estimate of the absolute numbers. They handle ambiguous names by just counting them as half a Muslim. According to their analysis the number of Muslim newborns more than doubled between 2000 and 2015.

This got me thinking about how to do this analysis right. Counting ambiguous names as half is a really ugly hack, likely to overcount names as long as Muslims are a minority. Instead one might use the regional and temporal variation to infer for each name separately how it contributes to the number of Muslims.

Once you have done that you can subtract a precise estimate of number of Muslims from the sickle cell data to get an estimate of the increase of Subsaharan Africans for each region. Which allows you to do the same inference for SS-African names, which are probably much more ambiguous than the Islamic ones.

If that works, you have ended up with a method to create precise estimates for both groups directly from given names, even in the likely case that the sickle cell data stops being published. Unfortunately this takes quite a lot of time. And of course there is no guarantee that it would work. Maybe a project for the future.



Four problems with cousin marriage

Cousin marriage was prevalent all over the world with the big exception of western Europe [1]. It still is especially common in the Islamic world. Marrying your relatives has the advantage of keeping the family together. Clans of up to several hundred closely related persons are the result and especially in a pre-state context, that is a pretty useful organisational unit. However, from a genetic or evolutionary perspective there are several potential problems with cousin marriage.

The obvious one is the prevalence of homozygosity runs, i.e. sections in the genome that are identical for the chromosome coming from the father and the chromosome received from the mother. These are generally problematic, because the other chromosome copy has to step in whenever something is significantly messed up in one chromosome. Homozygosity runs mean that mutational load hits with full force for some sections of the genome. There is probably an IQ loss of several points and congenital diseases become much more common.

However, all problems caused by homozygosity runs can be fixed by a single outbreeding event. But what if living in a clan environment has reduced the selection for individual achievement in a population for hundreds of years? The welfare state is often blamed for the reduced or reversed selection for positive traits, but a clan is a form of welfare state. The clan provides you with a job, a wife, takes care of you when you fall ill or lose your ability to feed your kids. It is conceivable that the existence of clan structures prevented the slow replacement of the lower class by the middle class that is conjectured to have raised the IQ in Europe until the nineteenth century [2].

Clan borders also work to a certain degree as genetic barriers. This means that positive mutations have a much harder time sweeping the population. If the default is marrying a relative, a positive mutation will have to sweep each clan separately and additionally jump from clan to clan.

The fourth potential problem I see is a reduced response to selection. Response to selection depends on the variance of the trait in question. Variance within each clan (not necessarily within the full population) will be lower for two reasons: The reduced genetic diversity and less assortative mating. Depending on how the selection pressure is structured this might reduce the speed of genetic adaptation.

It is possible that these four factors played a role in the precipitous fall of intellectual productivity that the Islamic world has experienced since the Islamic Golden Age [3].

[1] Cousin marriage Europe

[2] Farewell to alms

[3] Cousin marriage Middle East

Too much of a good thing

The gay germ theory proposes that there is a pathogen that causes male homosexuality [1]. The reasoning behind this theory is that the fitness hit of homosexuality is too big to allow a genetic explanation. I.e. any genetic variation involved would very quickly be weeded from the gene pool and the occurrence of homosexuality would depend on de novo mutations and be very rare. Pathogens on the other hand, can do with us whatever they want to, because we cannot out-evolve them.

However, there are several indications, that an immune reaction of the mother is somehow involved [2]. Here, the idea would be that embryonic tissue that expresses proteins alien to the immune system is attacked and damaged. This would explain a underdevelopment of a hypothalamic nucleus responsible for a male heterosexual orientation, which expresses male specific genes during masculinization and defeminization in utero.

If we combine both ideas, we end up with a putative pathogen, that might trigger an immune response against male specific proteins. This gives us a hint how to identify the pathogen in question: It would have to be a pathogen that shares an epitope with such a male specific protein. An epitope is a surface of a folded protein, which is recognized by the immune system’s antibody.

Unfortunately identifying epitopes from protein sequences requires a solution to the protein folding problem. So we are not going to be able to do it just by downloading a bunch of genetic sequences.

However, it is entirely possible that there isn’t actually a pathogen involved. The effect of shared environment on sexual orientation, for example, is zero [3]. Pretty weird for an infection. Instead male homosexuality might be a case of what I like to call the “too much of a good thing”-failure mode of evolution. Occasionally evolution finds itself in dead ends, where there is a selection pressure towards a good thing, and a catastrophic failure mode whenever there is too much of the good thing.

One example for this are the trisomies. Here the good thing is having a big ovum [4]. Human egg cells are pretty big and for good reason. After fertilization they have to divide quickly and set up shop in the uterus. If they run out of gas before the placenta is in place, that’s it. One way ova get big is very unequal division in the last two rounds of cell division. One new cell keeps all the cell plasma, the other one is discarded. Very unequal cell division results in a big final cell, but it also increases the likelihood that not all of half of the chromosomes can be stashed in the small cell.

Another example might be autism. One of the symptoms of autism is neural overgrowth in some parts of the cortex [5]. And while the average autist does not do too well on an IQ test, extreme precociousness in children is often accompanied with autism. It seems growing too many neurons and learning too fast has catastrophic failure modes too.

This could be going on with male homosexuality. A strong immune system is nice, but not if it attacks vital parts of you unborn child’s brain. Or conversely, toning down the immune system to accommodate your child is a fine thing, if it doesn’t get one or both of you killed. One could argue that evolution should find some sideway avenue to avoid the failure mode. But it didn’t for the trisomies, nor did it for autism.

[1] Gay germ theory

[2] Antibodies against male specific proteins in mothers of gay sons

[3] Shared environment of homosexuality is zero.

[4] The ovum is large.

[5] Brain overgrowth in autism

Counting names

In the last blogpost we counted a lot of names to determine how NE-Asians are represented among songwriting competition winners. In this post we take a closer look at some methods to determine ethnic composition of a dataset by surname analysis.

Without the anatomical baggage, the theory of NE-Asian relative underperformance boils down to “whatever Ashkenazim have that makes them successful (beyond the absolute IQ-value), NE-Asians probably have slightly less of it than Europeans”.

So we are naturally interested in the Ashkenazi representation among songwriting competition winners. We try to calculate it using two datasets of Ashkenazi names [1],[2]. Using the first set of names we count 17% percent Ashkenazi names among songwriting competition winners, using the second set of names we only find 6.5%. This mostly tells us that we cannot reliably assess the number of Ashkenazim with this method. There are too many names listed that are quite common among gentiles as well. And on the other hand, it is unclear how complete the lists are.

There has been much speculation, whether Ashkenazi intellectual performance is declining due to outmarriage and low fertility. Even if our method is too crude to give precise percentages, we can at least say that there is no evidence of declining performance in this dataset.

It seems counting Ashkenazim is methodologically significantly more difficult than counting NE-Asians. And of course the same holds true for African-Americans or Hispanics, who share a lot of surnames with White Americans.

Counting Ashkenazim is a time-honoured method that has been used to advance many different theories. Due to the Ashkenazi IQ advantage of 10 points, Ashkenazi overrepresentation can, for example, be used to assess the intellectual difficulty of a feat. For very high profile samples it is possible to check ethnicity by hand. For the NE-Asian songwriters this was only possible because a look at a picture is enough. Generally this is not possible for Ashkenazim which leads to a lot of potential data fudging. We would really like to do better than that.

How do we accurately and objectively assess the likely number of Ashkenazim in a sample of surnames? Let’s assume we have the probability distribution of Ashkenazi family names, i.e. the frequency with which each name appears in the Ashkenazi population. We also have the surname distribution in the general US population. Then we can create a mixed distribution of x% Ashkenazim and 100-x% non-Ashkenazim and calculate the likelihood of our sample given this mixed distribution. The likelihood is just the product of frequencies of the names in the sample. By doing this for different x we can find the mixed distribution that leads to the maximum likelihood of our sample. The x% used to create this maximum likelihood distribution is our best guess at the Ashkenazi percentage in our sample. Possibly, we have to adjust the percentage by the fraction of the population that is actually covered by our name database.

It may seem easier to just add the fractions of Ashkenazim for each name in the sample. So if we have ten names and in the general population for each of these names 10% are Ashkenazim, these ten names add up to one full Ashkenazim. Unfortunately, this undervalues overrepresented groups. In a sample with five-fold overrepresentation, these ten names should be treated as 50% Ashkkenazi. One idea would be to calculate one estimate, use this estimate to update the expected fraction of Ashkenazim for each name and then recursively get closer to the real value, but the maximum likelihood method also gives us a distribution of likely percentages.

The 2010 US census provides the surname distributions for the general US population, Whites, Blacks, Asians and Hispanics [3]. This allows us to test this method at least for these groups. For the songwriting competition winners it results in a maximum likelihood for an Asian percentage of 0.8%. This would correspond to 23 artists. Given that we only looked at NE-Asians in the last blogpost, this result is very compatible with our earlier result.

Likelihood for percentage of non-Asians peaks at 99.2%.

So we do have a method and we do have a motive, but unfortunately we don’t have a distribution of Ashkenazi surnames in the US. One way to get a distribution of Ashkenazi surnames would be by scraping the names of Holocaust victims from the Vad Vachem database [4]. However, that is certainly somewhat disrespectful and it is unclear whether the distribution is all that similar to the current American one. The other way is to collect names of American Jews from a variety of sources, here a hundred and there a hundred, until a significant fraction of the most common names is covered. However, I am currently too lazy to do that. But maybe I will get around to it, if something a lot more interesting than songwriting competition winners turns up.

[1] Ashkenazi names 1

[2] Ashkenazi names 2

[3] 2010 US Census

[4] Holocaust victim names

Verbal IQ and songwriting – NE-Asian underperformance

In my two part blog post “A theory of intelligence” I examine the unusual IQ profiles of both Ashkenazi Jews (high verbal) and NE-Asians (high math-spatial) to propose a theory of intelligence. This theory tries to explain the NE-Asian underperformance in GDP and science relative to their very high IQ, by positing that NE-Asians create fewer lateral and top-down synapses. This leads to slightly lower verbal IQ and conceptual creativity compared to Europeans and especially compared to Ashkenazim.

One of my intuitions is that verbal IQ tests do not pick up on this difference particularly well, because they also load on knowledge and pattern recognition. I wondered whether tail effects in verbally creative endeavors would maybe lend support to my theory. To this end I analyzed a dataset of songwriting competition winners [1].

The dataset consists of 2875 US artists that won prizes or honorary mentions in the years 2002-2017. To identify NE-Asian artists I compare the names against the most common Korean [2], Japanese [3] and Chinese [4] surnames. These surnames cover roughly 90%, 33% and 84.8% of these populations respectively. Each hit I then check by hand to exclude anybody who is provably not Asian (quite likely for some names like Young, Shaw or Lee).

Chinese Americans constitute 1.5% of the US population, Japanese Americans 0.4% and Korean Americans 0.8%. Multiplied with the sensitivity of our method, this leads to (0.0150.848 + 0.0040.33 + 0.0080.90)2875= 61 being the expected number of hits for a perfectly proportional representation of NE-Asian Americans.

Instead we only find 13 NE-Asian names that cannot be excluded, more than a four-fold underrepresentation. Of course, one may argue that this is a result of language deficiencies due to relatively recent immigration. However, there is also no upward trend visible over these 15 years. Only seven of these artists are unambiguously Japanese, Chinese or Korean American. Of the rest, one is Japanese but not American, one is Malaysian, one is Taiwanese (not included in the 1.5%) and four I could not identify.

Also, perfectly proportional representation may be the wrong baseline to compare against. Of the 1554 winners of the US Open Music Competition 2019 [5], a competition in classical music, 1050 have NE-Asian surnames. These also have much more typical names, with the most common being Yang, Wang, Chen, Li, Truong, Zhang, Liu, Loo, Wu, Lin. This makes it plausible that we are still overcounting NE-Asians in the songwriting dataset.

There is the additional fudge factor, that you can’t tell who wrote the lyrics. Christopher Tin, for example, whom we counted twice, is a classical composer. His most famous piece is Baba Yetu, the theme song of Civilisation IV. Its lyrics are a Swahili version of the Lord’s Prayer.

Overall we see at least a 4.5-fold underrepresentation relative to population percentage, compared to classical music winners a 150-fold underrepresentation. I chalk this up as consistent with my theory.

[1] International songwriting competition

[2] Korean Surnames

[3] Japanese surnames

[4] Chinese surnames

[5] US Open Music Competition Winners

Hereditarianism III: Discussion

In the last post, we have seen that for African-Americans and Hispanics, IQ varies according to ancestry. In this post we will discuss what this actually means and whether there is still leeway for the environmentalist to wriggle about.

The key idea of this kind of admixture study is to show that the differences between ethnic groups can entirely be explained by genetic factors. This is done by showing that the IQ differences within each ethnic group by ancestry extrapolate to the differences between ethnic groups. So it is essential that we only look at IQ and ancestry within each ethnic group.

Without a strict restriction to one ethnic group, it would not be enough to prove that IQ correlates with admixture. We already know that there is an IQ gap and we already know that there is an “admixture gap”. So a correlation is already a given.

But what if the self-identified ethnicity is noisy? For example some of the “Hispanics” might actually identify or be identified as White. In that case the correlation between ethnicity and IQ would bleed over into the IQ-admixture. Of course this assumption borders on paranoia. But the correlations observed are quite small, which means that admixture explains very little of the IQ variance in the data set, which might seem counterintuitive from a hereditarian perspective.

So what kind of correlation should we expect? If the European-Amerindian-gap is 16 points, similar to the Hispanic standard deviation, shouldn’t we expect admixture to explain a very significant part of the variation? Well, actually not. If admixture is uniformly distributed the mean difference in admixture between two Hispanics is only 33.3%. This means the average IQ difference explained by admixture would at most be 5-6 points. But the admixture is not uniformly distributed, Hispanics with less than 40% European admixture are notably rarer. This is why the actual standard deviation of admixture is just 23.3. So we are down to less than 4 points explained by admixture. This would lead to a correlation of 0.50 … given perfect data. But both the admixture data and especially the IQ data invariably contain noise, reducing this correlation further. So it is actually not surprising that we only see correlations between 0.17 (for the very range-restricted African Americans) and 0.41 (for much more uniformly distributed African-European Hispanics).

A better way than looking at correlations to drive home the meaning of the hereditarian hypothesis is to visualize how mean IQ of percentiles change. The hereditarian hypothesis posits, that IQ varies continuously with admixture. This means that the IQ averages of admixture percentiles will more or less linearly increase.

To show this effect for each percentile would require a much larger data set. This data set is almost too small and heterogeneous to show the effect convincingly for quartiles. For example, as we have seen, the Hispanic IQ is slightly depressed compared to the same admixture in African Americans. Because the middle region of European admixture is dominated by Hispanics this results in a depressed middle if we use the whole sample.

Instead we restrict ourselves to the Hispanic sample. Because the mean White and mean Asian IQ in our data is almost identical, we can just pool European and East Asian admixture to create a well-powered Hispanic quartile admixture plot:

n=323, slope=21.56, intercept=75.32, correlation=0.273, p-value=6.217e-07

Here, we see that the average IQ of the admixture quartiles fall pretty nicely on the regression line.
This plot perfectly illustrates the hereditarian hypothesis: The averages vary exactly according to admixture. (Note also, that if we plot a line through the first two quartile averages only, we would overshoot the mean white IQ, presumably because the lowest quartile is slightly environmentally depressed. This might be happening in the African-American sample.)

It is tough to come up with environmental causes for IQ differences that vary according to ancestry. Colorism is one of the best tries. Colorism is the idea that racism is graded by how dark somebodies skin is, which varies according to ancestry, and that this racism somehow reduces IQ. Except when you are NE-Asian … Colorism as the reason for IQ varying with ancestry, is a theory that has a lot to prove before it can be remotely taken seriously.

However, IQ varying by ancestry also doesn’t prove that the gap is fully genetic. Or, to put it differently, even if we could predict IQ perfectly directly from the genome, it remains theoretically possible that there are gene-environment feedback mechanisms involved that allow us to reduce the magnitude of the gap by improving living/learning conditions. Of course the history of intervention studies tells us not to hold our breath.

So, what are the take-aways from this series:

  1. IQ varies by ancestry within ethnic groups with the same country of birth.
  2. This intra-ethnic variation fully explains IQ differences between ethnic groups.
  3. This invalidates most environmental explanations for the IQ gaps.
  4. And strongly suggests a genetic reason for IQ gaps between ethnic groups.
  5. Ancestry nonetheless explains little individual IQ variation – people should be judged as individuals.

Hereditarianism II: Admixture Data and Gaps

In the last post, we have seen, that the environmentalist position about group differences in IQ is mostly based on the idea of x-factors. Factors hard to identify that vary systematically between groups and affect IQ. Given that there are many factors that vary between ethnic groups, this is a difficult theory to disprove.

However, from a hereditarian perspective, two persons belonging to the same ethnic group can sometimes be differentiated by different amounts of a certain genetic ancestry. So in ethnic groups whose members have varying degrees of admixture of some original founding populations we can put the hereditarian hypothesis to the test. This is the case for African-Americans, who have varying degrees of European ancestry and for Hispanics, who are mostly a mixture of Europeans, Amerindians and Africans.

The hereditarian hypothesis predicts that IQ will vary within these groups with the amount of admixture for any chosen ancestral group. This type of admixture study has the power to rule out the majority of x-factors that systematically vary between ethnic groups, except for those that vary roughly according to ancestry.

A recent paper showed IQ varying by ancestry for Hispanics and African Americans [1]. These are the key figures.

The regression line of the relationship between cognitive ability and European ancestry in African Americans
And the same thing for Hispanics …

In this post we are going to reanalyze the underlying data set. This data set contains IQ scores for a couple of hundred self-identified Whites, Blacks, Hispanics, East Asians + other minorities and the percentage of their genome being European, African, Amerindian, Asian etc.

First we translate the cognitive ability measure, here given in whole sample standard deviations above the sample mean, into IQ, with white mean = 100 and white standard deviation = 15.

n=137, slope=23.283, intercept=79.6, correlation=0.176, p-value=0.0392

The slope of 23.283 immediately gives us the gap between 100% European and 100% African, while the intercept provides us with the IQ of a 100% African African-American. The regression line overshoots the mean white IQ. This might be noise, or legitimately smarter white genes in the black population, or Amerindian admixture in the whites reducing the mean, or a slight environmental downward bent of the left part of the plot. But whether we take the estimated gap, or the difference between actual white mean IQ and the 100% African IQ, the result is always strikingly close to Galton’s estimate.

Of course this is just a very small sample. With a very restricted range. However, we can immediately replicate this regression line with those Hispanics that have predominately African and European admixture.

n=79, slope=23.837, intercept=73.33, correlation=0.416475096463478, p-value=0.000134

This gives us a virtually identical gap. But the whole line is shifted down. This vibes well with other results, see for example [2]. The average Hispanic IQ in this sample is only 89.5, compared to a usual US Hispanic IQ of 92-93, so it might still be missing a few points of Flynn effect. Note, however, that this seems to affect the entire IQ range in the same fashion.

The combined sample of African Americans and Euro-African Hispanics of course also validates Galton’s estimate of the gap almost perfectly.

n=257, slope=22.282, intercept=77.979, correlation=0.401 p-value=2.34e-11

For comparison, for Hispanics with predominantly European and Amerindian the admixture plot looks like this.

n=323, slope=16.65, intercept=80.024, correlation=0.233, p-value=2.231e-05

The gap is some 7 points smaller and the percentage of European admixture is generally quite high, which is why despite the missing Flynn effect points, the average Hispanic IQ is 89.5 vs 83.7 for African Americans.

[1] Biogeographic Ancestry, Cognitive Ability and Socioeconomic Outcomes

[2] A study of intelligence of children in Brazil